Cluster Analysis

Cluster Analysis

Chapter 11 Cluster Analysis Maybe Hamlet is right. We could be bounded in a nutshell, but counting ourselves kings of infinite space. Stephen Hawking...

12MB Sizes 5 Downloads 94 Views

Chapter 11

Cluster Analysis Maybe Hamlet is right. We could be bounded in a nutshell, but counting ourselves kings of infinite space. Stephen Hawking

11.1 INTRODUCTION Cluster analysis represents a set of very useful exploratory techniques that can be applied whenever we intend to verify the existence of similar behavior between observations (individuals, companies, municipalities, countries, among other examples) in relation to certain variables, and there is the intention of creating groups or clusters, in which an internal homogeneity prevails. In this regard, this set of techniques has as its main objective to allocate observations to a relatively small number of clusters that are internally homogeneous and heterogeneous between themselves, and that represent the joint behavior of the observations from certain variables. That is, the observations of a certain group must be relatively similar to one another, in relation to the variables inserted in the analysis, and significantly different from the observations found in other groups. Clustering techniques are considered exploratory, or interdependent, since their applications do not have a predictive nature for other observations not initially present in the sample. Moreover, the inclusion of new observations into the dataset makes it necessary to reapply the modeling, so that, possibly, new clusters can be generated. Besides, the inclusion of a new variable can also generate a complete rearrangement of the observations in the groups. Researchers can choose to develop a cluster analysis when their main goal is to sort and allocate observations to groups and, from then on, to analyze what the ideal number of clusters formed is. Or they can, a priori, define the number of groups they wish to create, based on certain criteria, and verify how the sorting and allocation of observations behave in that specific number of groups. Regardless of the objective, clustering will continue being exploratory. If a researcher aims to use a technique to, in fact, confirm the creation of groups and to make the analysis predictive, he can use techniques as, for example, discriminant analysis or multinomial logistic regression. Elaborating a cluster analysis does not require vast knowledge of matrix algebra or statistics, different from techniques such as factor analysis and correspondence analysis. The researcher interested in applying a cluster analysis needs to, starting from the definition of the research objectives, choose a certain distance or similarity measure that will be the basis for the observations to be considered less or much closer, and a certain agglomeration schedule that will have to be defined between hierarchical and nonhierarchical methods. Therefore, he will be able to analyze, interpret, and compare the outcomes. It is important to highlight that the outcomes obtained through hierarchical and nonhierarchical agglomeration schedules can be compared and, in this regard, the researcher is free to develop the technique, using one method or another, and to reapply it, if he deems necessary. While hierarchical schedules allow us to identify the sorting and allocation of observations, offering possibilities for researchers to study, assess, and decide the number of clusters formed in nonhierarchical schedules, we start with a known number of clusters and, from then on, we begin allocating the observations to these clusters, with a future evaluation of the representativeness of each variable when creating them. Therefore, the result of one method can serve as input to carry out the other, making the analysis cyclical. Fig. 11.1 shows the logic from which a cluster analysis can be elaborated. When choosing the distance or similarity measure and the agglomeration schedule, we must take some aspects into consideration, such as, the previously desired number of clusters, which were defined based on some resource allocation criteria, as well as certain constraints that may lead the researcher to choose a specific solution. According to Bussab et al. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00011-2 © 2019 Elsevier Inc. All rights reserved.

311

312

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.1 Logic for elaborating a cluster analysis.

(1990), different criteria regarding distance measures and agglomeration schedules may lead to different cluster formations, and the homogeneity desired by the researcher fundamentally depends on the objectives set in the research. Imagine that a researcher is interested in studying the interdependence between individuals living in a certain municipality based only on two metric variables (age, in years, and average family income, in R$). His main goal is to assess the effectiveness of social programs aimed at providing health care and then, based on these variables, to propose a still unknown number of new programs aimed at homogeneous groups of people. After collecting the data, the researcher constructed a scatter plot, as shown in Fig. 11.2. Based on the chart seen in Fig. 11.2, the researcher identified four clusters and highlighted them in a new chart (Fig. 11.3). From the creation of these clusters, the researcher decided to develop an analysis of the behavior of the observations in each group, or, more precisely, of the existing variability within the clusters and between them, so that he could clearly and consciously base his decision as regards the allocation of individuals to these four new social programs. In order to illustrate this issue, the researcher constructed the chart found in Fig. 11.4.

FIG. 11.2 Scatter plot with individuals’ Income and Age.

Cluster Analysis Chapter

11

313

FIG. 11.3 Highlighting the creation of four clusters.

FIG. 11.4 Illustrating the variability within the clusters and between them.

Based on this chart, the researcher was able to notice that the groups formed showed a lot of internal homogeneity, with a certain individual being closer to other individuals in the same group than to individuals in other groups. This is the core of cluster analysis. If the number of social programs to be provided for the population (number of clusters) had already been given to the researcher, due to budgetary, legal, or political constraints, even so we would be able to use clustering, solely, to determine the allocation of individuals from the municipality to that number of programs (groups).

314

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.5 Rearranging the clusters due to the presence of elderly billionaires.

Having concluded the research and allocated the individuals to the different social, health care programs, the following year, the researcher decided to carry out the same research with individuals from the same municipality. However, in the meantime, a group of elderly billionaires decided to move to that city, and, when he constructed the new scatter plot, the researcher realized that those four clusters, clearly formed the previous year, did not exist anymore, since they fused when the billionaires were included. The new scatter plot can be seen in Fig. 11.5. This new situation exemplifies the importance of always reapplying the cluster analysis whenever new observations are included (and also new variables), which deprives it from and makes its predictive power totally unfeasible, as we have already discussed. Moreover, before elaborating any cluster analysis, this example shows that it is advisable for the researcher to study the data behavior and to check the existence of discrepant observations in relation to certain variables, since the creation of clusters is very sensitive to the presence of outliers. Excluding or retaining outliers in the dataset, however, will depend on the research objectives and on the type of data researcher have. Since, if certain observations represent anomalies in terms of variable values, when compared to the other observations, and end up forming small, insignificant, or even individual clusters, they can, in fact, be excluded. On the other hand, if these observations represent one or more relevant groups, even if they are different from the others, they must be considered in the analysis and, whenever the technique is reapplied, they can be separated so that other segmentations can be better structured in new groups, formed with higher internal homogeneity. We would like to emphasize that cluster analysis methods are considered static procedures, since the inclusion of new observations or variables may change the clusters, thus, making it mandatory to develop a new analysis. In this example, we realized that the original variables from which the groups are established are metric, since the clustering started from the study of the distance behavior (dissimilarity measures) between the observations. In some cases, as we will study throughout this chapter, cluster analyses can be elaborated from the similarity behavior (similarity measures) between observations that present binary variables. However, it is common for researchers to use the incorrect arbitrary weighting procedure with qualitative variables, as, for example, variables on the Likert scale, and, from then on, to apply a cluster analysis. This is a major error, since there are exploratory techniques meant exclusively for the study of the behavior of qualitative variables as, for example, the correspondence analysis. Historically speaking, even though many distance and similarity measures date back to the end of the 19th century and the beginning of the 20th century, cluster analyses, as a better structured set of techniques, began in the field of Anthropology with Driver and Kroeber (1932), and in Psychology with Zubin (1938a,b) and Tryon (1939), as discussed by Reis

Cluster Analysis Chapter

11

315

(2001) and Fa´vero et al. (2009). With the acknowledgment that observation clustering and classification procedures are scientific methods, together with astonishing technological developments, mainly verified after the 1960s, cluster analyses started being used more frequently after Sokal and Sneath’s (1963) relevant work was published, in which procedures are carried out to compare the biological similarities of organisms with similar characteristics and the respective species. Currently, cluster analysis offers several application possibilities in the fields of consumer behavior, market segmentation, strategy, political science, economics, finance, accounting, actuarial science, engineering, logistics, computer science, education, medicine, biology, genetics, biostatistics, psychology, anthropology, demography, geography, ecology, climatology, geology, archeology, criminology and forensics, among others. In this chapter, we will discuss cluster analysis techniques, aiming at: (1) introducing the concepts; (2) presenting the step by step of modeling, in an algebraic and practical way; (3) interpreting the results obtained; and (4) applying the technique in SPSS and in Stata. Following the logic proposed in the book, first, we will present the algebraic solution of an example jointly with the presentation of the concepts. Only after the introduction of concepts will the procedures for elaborating the techniques in SPSS and Stata be presented.

11.2 CLUSTER ANALYSIS Many are the procedures for elaborating a cluster analysis, since there are different distance or similarity measures for metric or binary variables, respectively. Besides, after defining the distance or similarity measure, the researcher still needs to determine, among several possibilities, the observation clustering method, from certain hierarchical or nonhierarchical criteria. Therefore, when one wishes to group observations in internally homogeneous clusters, what initially seems trivial can become quite complex, because there are multiple combinations between different distance or similarity measures and clustering methods. Hence, based on the underlying theory and on his research objectives, as well as on his experience and intuition, it is extremely important for the researcher to define the criteria from which the observations will be allocated to each one of the groups. In the following sections, we will discuss the theoretical development of the technique, along with a practical example. In Sections 11.2.1 and 11.2.2, the concepts of distance and similarity measures and clustering methods are presented and discussed, respectively, always followed by the algebraic solutions developed from a dataset.

11.2.1

Defining Distance or Similarity Measures in Cluster Analysis

As we have already discussed, the first phase for elaborating a cluster analysis consists in defining the distance (dissimilarity) or similarity measure that will be the basis for each observation to be allocated to a certain group. Distance measures are frequently used when the variables in the dataset are essentially metric, since, the greater the differences between the variable values of two observations the smaller the similarity between them or, in other words, the higher the dissimilarity. On the other hand, similarity measures are often used when the variables are binary, and what most interests us is the frequency of converging answer pairs 1-1 or 0-0 of two observations. In this case, the greater the frequency of converging pairs, the higher the similarity between the observations. An exception to this rule is Pearson’s correlation coefficient between two observations, calculated from metric variables, however, with similarity characteristics, as we will see in the following section. We will study the dissimilarity measures for metric variables in Section 11.2.1.1 and, in Section 11.2.1.2, we will discuss the similarity measures for binary variables.

11.2.1.1 Distance (Dissimilarity) Measures Between Observations for Metric Variables As a hypothetical situation, imagine that we intend to calculate the distance between two observations i (i ¼ 1, 2) from a dataset that has three metric variables (X1i, X2i, X3i), with values in the same unit of measure. These data can be found in Table 11.1. It is possible to illustrate the configuration of both observations in a three-dimensional space from these data, since we have exactly three variables. Fig. 11.6 shows the relative position of each observation, emphasizing the distance between them (d12). Distance d12, which is a dissimilarity measure, can be easily calculated by using, for instance, its projection over the horizontal plane formed by axes X1 and X2, called distance d0 12, as shown in Fig. 11.7.

316

PART

V

Multivariate Exploratory Data Analysis

TABLE 11.1 Part of a Dataset With Two Observations and Three Metric Variables Observation i

X1i

X2i

X3i

1

3.7

2.7

9.1

2

7.8

8.0

1.5

X3

1

d12

2 X2 X1 FIG. 11.6 Three-dimensional scatter plot for the hypothetical situation with two observations and three variables.

Thus, based on the well-known Pythagorean distance formula for right-angled triangles, we can determine d12 through the following expression: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (11.1) d12 ¼ ðd0 12 Þ2 + ðX31  X32 Þ2 where j X31  X32 j is the distance of the vertical projections (axis X3) from points 1 and 2. However, distance d0 12 is unknown to us, so, once again, we need to use the Pythagorean formula, now using the distances of the projections from Points 1 and 2 over the other two axes (X1 and X2), as shown in Fig. 11.8. Thus, we can say that: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (11.2) d0 12 ¼ ðX11  X12 Þ2 + ðX21  X22 Þ2 and, substituting (2) in (1), we have: d12 ¼

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðX11  X12 Þ2 + ðX21  X22 Þ2 + ðX31  X32 Þ2 ,

(11.3)

which is the expression of distance (dissimilarity measure) between Points 1 and 2, also known as the Euclidean distance formula.

Cluster Analysis Chapter

11

317

X3

|X31–X32|

1

d12

d′

12

2

X2 X1 FIG. 11.7 Three-dimensional chart highlighting the projection of d12 over the horizontal plane.

FIG. 11.8 Projection of the points over the plane formed by X1 and X2 with emphasis on d´12.

X2

|X21–X22|



12

2

1 |X11–X12|

X1

318

PART

V

Multivariate Exploratory Data Analysis

Therefore, for the data in our example, we have: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi d12 ¼ ð3:7  7:8Þ2 + ð2:7  8:0Þ2 + ð9:1  1:5Þ2 ¼ 10:132 whose unit of measure is the same as for the original variables in the dataset. It is important to highlight that, if the variables do not have the same unit of measure, a data standardization procedure will have to be carried out previously, as we will discuss later. We can generalize this problem for a situation in which the dataset has n observations and, for each observation i (i ¼ 1, ..., n), values corresponding to each one of the j (j ¼ 1, ..., k) metric variables X, as shown in Table 11.2. So, Expression (11.4), based on Expression (11.3), presents the general definition of the Euclidian distance between any two observations p and q. vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi k   2  2  2ffi u 2 uX X1p  X1q + X2p  X2q + … + Xkp  Xkq ¼ t Xjp  Xjq (11.4) dpq ¼ j¼1

Although the Euclidian distance is the most commonly used in cluster analyses, there are other dissimilarity measures that can be used, and using each one of them will depend on the researcher’s assumptions and objectives. Next, we will discuss other dissimilarity measures that can be used: l

Euclidean squared distance: instead of the Euclidian distance, it can be used when the variables show a small dispersion in value, and the use of the squared Euclidian distance makes it easier to interpret the outputs of the analysis and the allocation of the observations to the groups. Its expression is given by: k   2  2  2 X 2 Xjp  Xjq dpq ¼ X1p  X1q + X2p  X2q + … + Xkp  Xkq ¼

(11.5)

j¼1 l

Minkowski Distance: it is the most general dissimilarity measure expression from which others derive. It is given by: "

 m k  X   dpq ¼ Xjp  Xjq 

#1

m

(11.6)

j¼1

where m takes on positive integer values (m ¼ 1, 2, ...). We can see that the Euclidian distance is a particular case of the Minkowski distance, when m ¼ 2.

TABLE 11.2 General Model of a Dataset for Elaborating the Cluster Analysis Variable j Observation i

X1i

X2i



Xki

1

X11

X21



Xk1

2

X12

X22







P

X1p

X2p







q

X1q

X2q

Xkq









n

X1n

X2n

Xkn

Xk2

Xkp

Cluster Analysis Chapter

l

11

319

Manhattan Distance: also referred to as the absolute or city block distance, it does not consider the triangular geometry that is inherent to Pythagoras’ initial expression and only considers the differences between the values of each variable. Its expression, also a particular case of the Minkowski distance when m ¼ 1, is given by: dpq ¼

k   X   Xjp  Xjq 

(11.7)

j¼1 l

Chebyshev Distance: also referred to as infinite or maximum distance, it is a particular case of the Manhattan distance because it only considers, for two observations, the maximum difference between all the j variables being studied. Its expression is given by:     (11.8) dpq ¼ max Xjp  Xjq 

It is a particular case of the Minkowski distance as well, when m ¼ ∞. l

Canberra Distance: used for the cases in which the variables only have positive values, it assumes values between 0 and j (number of variables). Its expression is given by:     k X  X  X jp jq   (11.9) dpq ¼ j¼1 Xjp + Xjq

Whenever there are metric variables, the researcher can also use Pearson’s correlation, which, even though, is not a dissimilarity measure (in fact, it is a similarity measure), can provide important information when the aim is to group rows from the dataset. Pearson’s correlation expression, between the values of any two observations p and q, based on Expression (4.11) presented in Chapter 4, can be written as follows: k  X

   Xjp  Xp  Xjq  Xq

j¼1

rpq ¼ vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uX k  2 u 2 u k  uX t Xjp  Xp  t Xjq  Xq j¼1

(11.10)

j¼1

where Xp and Xq represent the mean of all variable values for observations p and q, respectively, that is, the mean of each one of the rows in the dataset. Therefore, we can see that we are dealing with a coefficient of correlation between rows and not between columns (variables). It is the most common in data analysis and its values vary between 1 and 1. Pearson’s correlation coefficient can be used as a similarity measure between the rows of the dataset in analyses that include time series, for example, that is, cases in which the observations represent periods. In this case, the researcher may intend to study the correlations between different periods, to investigate, for instance, a possible recurrence of behavior in the same row for the set of variables, which may cause certain periods, not necessarily subsequent ones, to be grouped by similarity of behavior. Going back to the data presented in Table 11.1, we can calculate the different distance measures between observations 1 and 2, given by Expressions (11.4)–(11.9), as well as the correlational similarity measure, given by Expression (11.10). Table 11.3 shows these calculations and the respective results. Based on the results shown in Table 11.3, we can see that different measures produce different results, which may cause the observations to be allocated to different homogeneous clusters, depending on which measure was chosen for the analysis, as discussed by Vicini and Souza (2005) and Malhotra (2012). Therefore, it is essential for the researcher to always underpin his choice and to bear in mind the reasons why he decided to use a certain measure, instead of others. Simply using more than one measure, when analyzing the same dataset, can support this decision, since, in this case, the results can be compared. This becomes really clear when we include a third observation in the analysis, as shown in Table 11.4. While the Euclidian distance suggests that the most similar observations (the shortest distance) are 2 and 3, when we use the Chebyshev distance, observations 1 and 3 are the most similar. Table 11.5 shows these distances for each pair of observations, highlighting, in bold characters, the smallest value of each distance.

320

PART

V

Multivariate Exploratory Data Analysis

TABLE 11.3 Distance and Correlational Similarity Measures Between Observations 1 and 2 Observation i

X1i

X2i

X3i

Mean

1

3.7

2.7

9.1

5.167

2

7.8

8.0

1.5

5.767

Euclidian Distance qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi d12 ¼ ð3:7  7:8Þ2 + ð2:7  8:0Þ2 + ð9:1  1:5Þ2 ¼ 10:132 Squared Euclidean Distance d12 ¼ (3.7  7.8)2 + (2.7  8.0)2 + (9.1  1.5)2 ¼ 102.660 Manhattan Distance d12 ¼ j3.7  7.8j + j 2.7  8.0j + j9.1  1.5j ¼ 17.000 Chebyshev Distance d12 ¼ j9.1  1.5j ¼ 7.600 Canberra Distance 3:77:8j j2:78:0j j9:11:5j d12 ¼ ðj3:7 + 7:8Þ + ð2:7 + 8:0Þ + ð9:1 + 1:5Þ ¼ 1:569

Pearson’s Correlation (Similarity) ð3:75:167Þ  ð7:85:767Þ + ð2:75:167Þ  ð8:05:767Þ + ð9:15:167Þ  ð1:55:767Þ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r12 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:993 2 2 2 2 2 2 ð3:75:167Þ + ð2:75:167Þ + ð9:15:167Þ 

ð7:85:767Þ + ð8:05:767Þ + ð1:55:767Þ

TABLE 11.4 Part of the Dataset With Three Observations and Three Metric Variables Observation i

X1i

X2i

X3i

1

3.7

2.7

9.1

2

7.8

8.0

1.5

3

8.9

1.0

2.7

TABLE 11.5 Euclidian and Chebyshev Distances Between the Pairs of Observations Seen in Table 11.4 Distance

Pair of Observations 1 and 2

Pair of Observations 1 and 3

Pair of Observations 2 and 3

Euclidian

d12 ¼ 10.132

d13 ¼ 8.420

d23 5 7.187

Chebyshev

d12 ¼ 7.600

d13 5 6.400

d23 ¼ 7.000

Hence, in a certain cluster schedule, and only due to the dissimilarity measure chosen, we would have different initial clusters. Besides deciding which distance measure to choose, the researcher also has to verify if the data need to be treated previously. So far, in the examples we have already discussed, we were careful to choose metric variables with values in the same unit of measure (as, for example, students’ grades in Math, Physics, and Chemistry, which vary from 0 the 10). However, if the variables are measured in different units (as, for example, income in R$, educational level in years of study, and number of children), the intensity of the distances between the observations may be arbitrarily influenced by the variables that will possibly present greater magnitude in their values, to the detriment the others. In these situations, the

Cluster Analysis Chapter

11

321

researcher must standardize the data, so that the arbitrary nature of the measurement units may be eliminated, making each variable have the same contribution over the distance measure considered. Z-scores procedure is the most frequently used method to standardize variables. In it, for each observation i, the value of a new standardized variable ZXj is obtained by subtracting the corresponding original variable value Xj from its mean and, after that, the resulting value is divided by its standard deviation, as presented in Expression (11.11). ZXji ¼

Xji  Xj sj

(11.11)

where X and s represent the mean and the standard deviation of variable Xj. Hence, regardless of the magnitude of the values and of the type of measurement units of the original variables in a dataset, all the respective variables standardized by the Z-scores procedure will have a mean equal to zero and a standard deviation equal to 1, which ensures that possible arbitrary measurement units over the distance between each pair of observations will be eliminated. In addition, Z-scores have the advantage of not changing the distribution of the original variable. Therefore, if the original variables are different units, distance measure Expressions (11.4)–(11.9) must have the terms Xjp and Xjq, respectively, substituted for ZXjp and ZXjq. Table 11.6 presents these expressions, based on the standardized variables. Even though Pearson’s correlation is not a dissimilarity measure (in fact, it is a similarity measure), it is important to mention that its use also requires that the variables be standardized by using the Z-scores procedure in case they do not have the same measurement units. If the main goal were to group variables, which is the main goal of the following chapter (factor analysis), the standardization of variables through the Z-scores procedure would, in fact, be irrelevant, given that the analysis would consist in assessing the correlation between columns of the dataset. On the other hand, as the objective of this chapter is to group rows from the dataset that represent the observations, the standardization of the variables is necessary for elaborating an accurate cluster analysis.

11.2.1.2 Similarity Measures Between Observations for Binary Variables Now, imagine that we intend to calculate the distance between two observations i (i ¼ 1, 2) coming from a dataset that has seven variables (X1i, ..., X7i), however, all of them related to the presence or absence of characteristics. In this situation, it is common for the presence or absence of a certain characteristic to be represented by a binary variable, or a dummy, which assumes value 1, in case the characteristic occurs, and 0, if otherwise. These data can be found in Table 11.7. It is important to highlight that the use of binary variables does not generate arbitrary weighting problems resulting from the variable categories, contrary to what would happen if discrete values (1, 2, 3, ...) were assigned to each category of each qualitative variable. In this regard, if a certain qualitative variable has k categories, (k  1) binary variables will be necessary to represent the presence or absence of each one of the categories. Thus, all the binary variables will be equal to 0 in case the reference category occurs.

TABLE 11.6 Distance Measure Expressions With Standardized Variables Distance Measure (Dissimilarity)

Expression

Euclidian

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 k  P dpq ¼ ZX jp  ZX jq j¼1

Squared Euclidean

dpq ¼

dpq ¼

Chebyshev Canberra

j¼1

"

Minkowski

Manhattan

2 k  P ZX jp  ZX jq

dpq ¼

m k  P   ZX jp  ZX jq 

j¼1

 k  P   ZX jp  ZX jq 

j¼1

dpq ¼ ma´xjZXjp  ZXjq j dpq ¼

k ZX ZX P j jp jq j

ZX + ZX jq Þ j¼1 ð jp

#1

m

322

PART

V

Multivariate Exploratory Data Analysis

TABLE 11.7 Part of the Dataset With Two Observations and Seven Binary Variables Observation i

X1i

X2i

X3i

X4i

X5i

X6i

X7i

1

0

0

1

1

0

1

1

2

0

1

1

1

1

0

1

Therefore, by using Expression (11.4), we can calculate the squared Euclidean distance between observations 1 and 2, as follows: d12 ¼

7  X

Xj1  Xj2

2

¼ ð0  0Þ2 + ð0  1Þ2 + ð1  1Þ2 + ð1  1Þ2 + ð0  1Þ2 + ð1  0Þ2 + ð1  1Þ2 ¼ 3,

j¼1

which represents the total number of variables with answer differences between observations 1 and 2. Therefore, for any two observations p and q, the greater the number of equal answers (0-0 or 1-1), the shorter the squared Euclidean distance between them will be, since: 8   2 < 0 if X ¼ X ¼ 0 jp jq 1 Xjp  Xjq ¼ (11.12) : 1 if X 6¼ X jp

jq

As discussed by Johnson and Wichern (2007), each stretch of the distance represented by Expression (11.12) is considered to be a dissimilarity measure, since the greater the number of answer discrepancies, the greater the squared Euclidean distances. On the other hand, the calculations equally ponder the pairs of answers 0-0 and 1-1, without giving higher relative importance to the pair of answers 1-1 that, in many cases, is a stronger similarity indicator than the pair of answers 0-0. For example, when we group people, the fact that two of them eat lobster every day is a stronger similarity evidence than the absence of this characteristic for both. Hence, many authors, aiming at defining similarity measures between observations, proposed the use of coefficients that would take the similarity of the answers 1-1 and 0-0 into consideration, and these pairs would not necessarily have the same relative importance. In order for us to be able to present these measures, it is necessary to construct an absolute frequency table of answers 0 and 1 for each pair of observations p and q, as shown in Table 11.8. Next, based on this table, we will discuss the main similarity measures, bearing in mind that the use of each one depends on the researcher’s assumptions and objectives. Simple matching coefficient (SMC): it is the most frequently used similarity measure for binary variables, and it is discussed and used by Zubin (1938a), and by Sokal and Michener (1958). This coefficient, which matches the weights of the converging 1-1 and 0-0 answers, has its expression given by:

l

spq ¼

a+d a+b+c+d

(11.13)

TABLE 11.8 Absolute Frequencies of Answers 0 and 1 for Two Observations p and q Observation p Observation q

1

0

Total

1

a

b

a+b

0

c

d

c+d

Total

a+c

b+d

a+b+c+d

Cluster Analysis Chapter

l

2a 2a+b+c

a a + 2  ðb + cÞ

a a+b+c+d

(11.17)

(11.18)

Yule similarity coefficient: proposed by Yule (1900) and used by Yule and Kendall (1950), this similarity coefficient for binary variables offers as an answer a coefficient that varies from 1 to 1. As we can see, through its expression presented, the coefficient generated is undefined if one or both vectors compared present all the values equal to 0 or 1. Software such as Stata generate the Yule coefficient equal to 1, if b ¼ c ¼ 0 (a total convergence of answers), and equal to 1, if a ¼ d ¼ 0 (a total divergence of answers). spq ¼

l

(11.16)

Ochiai similarity coefficient: even though it is known by this name, it was initially proposed by Driver and Kroeber (1932), and, later on, it was used by Ochiai (1957). This coefficient is undefined when one or both observations being studied present all the variable values equal to 0. However, if both vectors present all the values equal to 0, software such as Stata present the Ochiai coefficient equal to 1. If this happens for only one of the two vectors, the Ochiai coefficient is considered equal to 0. Its expression is given by: a spq ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ða + bÞ  ða + cÞ

l

(11.15)

Russel and Rao similarity coefficient: it is also widely used and it only favors the similarities of 1-1 answers in the calculation of its coefficient. It was proposed by Russell and Rao (1940), and its expression is given by: spq ¼

l

(11.14)

anti-Dice similarity coefficient: it was initially proposed by Sokal and Sneath (1963) and Anderberg (1973), the name anti-Dice comes from the fact that this coefficient doubles the weight over the frequencies of different type 1-1 answer pairs, that is, it doubles the weight over the answer divergences. Just as the Jaccard and the Dice coefficients, the antiDice coefficient also ignores the frequency of 0-0 answer pairs. Its expression is given by: spq ¼

l

a a+b+c

Dice similarity coefficient (DSC): although it is only known by this name, it was suggested and discussed by Czekanowski (1932), Dice (1945) and Sørensen (1948). It is similar to the Jaccard index; however, it doubles the weight over the frequency of converging type 1-1 answer pairs. Just as in that case, software such as Stata present the Dice coefficient equal to 1, for the cases in which all the variables are equal to 0 for two observations, thus, avoiding any uncertainty in the calculation. Its expression is given by: spq ¼

l

323

Jaccard index: even though it was first proposed by Gilbert (1884), it received this name because it was discussed and used in two extremely important papers developed by Jaccard (1901, 1908). This measure, also known as Jaccard similarity coefficient, does not take the frequency of the pair of answers 0-0 into consideration, which is considered irrelevant. However, it is possible to come across a situation in which all the variables are equal to 0 for two observations, that is, there is only frequency in cell d of Table 11.8. In this case, software packages such as Stata present the Jaccard index equal to 1, which makes sense from a similarity standpoint. Its expression is given by: spq ¼

l

11

adbc ad+bc

(11.19)

Rogers and Tanimoto similarity coefficient: this coefficient, which doubles the weight of discrepant answers 0-1 and 1-0 in relation to the weight of the combinations of converging type 1-1 and 0-0 answers, was initially proposed by Rogers and Tanimoto (1960). Its expression, which becomes equal to the anti-Dice coefficient when the frequency of 0-0 answers is equal to 0 (d ¼ 0), is given by: spq ¼

a+d a + d + 2  ðb + c Þ

(11.20)

324

l

PART

V

Multivariate Exploratory Data Analysis

Sneath and Sokal similarity coefficient: different from the Rogers and Tanimoto coefficient, this coefficient, proposed by Sneath and Sokal (1962), doubles the weight of converging type 1-1 and 0-0 answers in relation to the other answer combinations (1-0 and 0-1). Its expression, which becomes equal to the Dice coefficient when the frequency of type 0-0 answers is equal to 0 (d ¼ 0), is given by: spq ¼

l

2  ða + d Þ 2  ða + dÞ + b + c

(11.21)

Hamann similarity coefficient: Hamann (1961) proposed this similarity coefficient for binary variables aiming at having the frequencies of discrepant answers (1-0 and 0-1) subtracted from the total of converging answers (1-1 and 0-0). This coefficient, which varies from 1 (total answer divergence) to 1 (total answer convergence), is equal to two times the simple matching coefficient minus 1. Its expression is given by: spq ¼

ða + d Þ  ð b + cÞ a+b+c+d

(11.22)

As was discussed in Section 11.2.1.1 as regards the dissimilarity measures applied to metric variables, let’s go back to the data presented in Table 11.7, aiming at calculating the different similarity measures between observations 1 and 2, which only have binary variables. In order to do that, from that table, we must construct the absolute frequency table of answers 0 and 1 for the observations mentioned (Table 11.9). So, using Expressions (11.13)–(11.22), we are able to calculate the similarity measures themselves. Table 11.10 presents the calculations and the results of each coefficient. Analogous to what was discussed when the dissimilarity measures were calculated, we can clearly see that different similarity measures generate different results, which may cause, when defining the cluster method, the observations to be allocated to different homogeneous clusters, depending on which measure was chosen for the analysis. Bear in mind that it does not make any sense to apply the Z-scores standardization procedure to calculate the similarity measures discussed in this section, since the variables used for the cluster analysis are binary. At this moment, it is important to emphasize that, instead of using similarity measures to define the clusters whenever there are binary variables, it is very common to define clusters from the coordinates of each observation, which can be generated when elaborating simple or multiple correspondence analyses, for instance. This is an exploratory technique applied solely to datasets that have qualitative variables, aiming at creating perceptual maps, which are constructed based on the frequency of the categories of each one of the variables in analysis (Fa´vero and Belfiore, 2017). After defining the coefficient that will be used, based on the research objectives, on the underlying theory, and on his experience and intuition, the researcher must move on to the definition of the cluster schedule. The main cluster analysis schedules will be studied in the following section.

11.2.2

Agglomeration Schedules in Cluster Analysis

As discussed by Vicini and Souza (2005) and Johnson and Wichern (2007), in cluster analysis, choosing the clustering method, also known as agglomeration schedule, is as important as defining the distance (or similarity) measure, and this decision must also be made based on what researchers intends to do in terms of their research objectives.

TABLE 11.9 Absolute Frequencies of Answers 0 and 1 for Observations 1 and 2 Observation 1 Observation 2

1

0

Total

1

3

2

5

0

1

1

2

Total

4

3

7

Cluster Analysis Chapter

11

325

TABLE 11.10 Similarity Measures Between Observations 1 and 2 Simple Matching:

Jaccard:

s12 ¼ 3 +7 1 ¼ 0:571

s12 ¼ 36 ¼ 0:500

Dice:

Anti-Dice:

s12 ¼ 2  ð32Þð+3Þ2 + 1 ¼ 0:667

s12 ¼ 3 + 2 3ð2 + 1Þ ¼ 0:333

Russell and Rao:

Ochiai:

s12 ¼ 37 ¼ 0:429

3 s12 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:671

Yule:

Rogers and Tanimoto:

1 s12 ¼ 33  112 + 2  1 ¼ 0:200

s12 ¼ 3 + 1 +32+1ð2 + 1Þ ¼ 0:400

Sneath and Sokal:

Hamann:

s12 ¼ 2  ð23 +ð13Þ++12Þ + 1 ¼ 0:727

s12 ¼ ð3 + 1Þ7 ð2 + 1Þ ¼ 0:143

ð3 + 2Þ  ð3 + 1Þ

Basically, agglomeration schedules can be classified into two types, hierarchicals and nonhierarchicals. While the former characterize themselves for favoring a hierarchical structure (step by step) when forming clusters, nonhierarchical schedules use algorithms to maximize the homogeneity within each cluster, without going through a hierarchical process for such. Hierarchical agglomeration schedules can be clustering or partitioning, depending on how the process starts. If all the observations are considered to be separated and, from their distances (or similarities), groups are formed until we reach a final stage with only one cluster, then this process is known as clustering. Among all hierarchical agglomeration schedules, the most commonly used are those that have the following linkage methods: nearest-neighbor or singlelinkage, furthest-neighbor or complete-linkage, or between-groups or average-linkage. On the other hand, if all the observations are considered grouped and, stage after stage, smaller groups are formed by the separation of each observation, until these subdivisions generate individual groups (that is, totally separated observations), then, we have a partitioning process. Conversely, nonhierarchical agglomeration schedules, among which the most popular is the k-means procedure, refer to processes in which clustering centers are defined, and from which the observations are allocated based on their proximity to them. Different from hierarchical schedules, in which the researcher can study the several possibilities for allocating observations and even define the ideal number of clusters based on each one of the grouping stages, a nonhierarchical agglomeration schedule requires that we previously stipulate the number of clusters from which the clustering centers will be defined and the observations allocated. That is why we recommend the generation of a hierarchical agglomeration schedule before constructing a nonhierarchical schedule, when there is no reasonable estimate of the number of clusters that can be formed from the observations in the dataset and based on the variables in study. Fig. 11.9 shows the logic of agglomeration schedules in cluster analysis. We will study hierarchical agglomeration schedules in Section 11.2.2.1, and Section 11.2.2.2 will be used to discuss the nonhierarchical k-means agglomeration schedule.

11.2.2.1 Hierarchical Agglomeration Schedules In this section, we will discuss the main hierarchical agglomeration schedules, in which larger and larger clusters are formed at each clustering stage because new observations or groups are added to it, due to a certain criterion (linkage method) and based on the distance measure chosen. In Section 11.2.2.1.1, the main concepts of these schedules will be presented, and, in Section 11.2.2.1.2, a practical example will be presented and solved algebraically. 11.2.2.1.1

Notation

There are three main linkage methods in hierarchical agglomeration schedules, as shown in Fig. 11.9: the nearest-neighbor or single-linkage, the furthest-neighbor or complete-linkage, and the between-groups or average-linkage. Table 11.11 illustrates the distance to be considered in each clustering stage, based on the linkage method chosen.

326

PART

V

Multivariate Exploratory Data Analysis

Agglomeration schedule

Nonhierarchical (k-means)

Hierarchical

Agglomerative

Partitioning

Linkage method

Furthest neighbor (complete linkage)

Nearest neighbor (single linkage)

Between groups (average linkage)

FIG. 11.9 Agglomeration schedules in cluster analysis.

TABLE 11.11 Distance to be Considered Based on the Linkage Method Linkage Method Single (Nearest-Neighbor or Single-Linkage)

Illustration

Distance (Dissimilarity) d23

1

4 3 2

Complete (Furthest-Neighbor or Complete-Linkage)

5

d15 1

4 3 5

2

Average (Between-Groups or Average-Linkage)

d13 + d14 + d15 + d23 + d24 + d25 6

1

4 3 2

5

The single-linkage method favors the shortest distances (thus, the nomenclature nearest neighbor) so that new clusters can be formed at each clustering stage through the incorporation of observations or groups. Therefore, applying it is advisable in cases in which the observations are relatively far apart, that is, different, and we would like to form clusters considering a minimum of homogeneity. On the other hand, its analysis may be hampered when there are observations or clusters just a little farther apart from each other, as shown in Fig. 11.10. The complete-linkage method, on the other hand, goes in the opposite direction, that is, it favors the greatest distances between the observations or groups so that new clusters can be formed (hence, the name furthest neighbor) and, in this regard, using it is advisable in cases in which there is no considerable distance between the observations, and the researcher needs to identify the heterogeneities between them. Finally, in the average-linkage method, two groups merge based on the average distance between all the pairs of observations that are in these groups (hence, the name average linkage). Accordingly, even though there are changes in the calculation of the distance measures between the clusters, the average-linkage method ends up preserving the order of the observations in each group, offered by the single-linkage method, in case there is a considerable distance between the observations. The same happens with the sorting solution provided by the complete-linkage method, if the observations are very close to each other.

Cluster Analysis Chapter

11

327

FIG. 11.10 Single-linkage method—Hampered analysis when there are observations or clusters just a little further apart.

Johnson and Wichern (2007) proposed a logical sequence of steps in order to facilitate the understanding of a cluster analysis, elaborated through a certain hierarchical agglomerative method: 1. If n is the number of observations in a dataset, we must start the agglomeration schedule with exactly n individual groups (stage 0), such that we will initially have a distances (or similarities) matrix D0 formed by the distances between each pair of observations. 2. In the first stage, we must choose the smallest distance among all of those that form matrix D0, that is, the one that connects the two most similar observations. At this exact moment, we will not have n individual groups any longer, we will have (n  1) groups, and one of them is formed by two observations. 3. In the following clustering stage, we must repeat the previous stage. However, we now have to take the distance between each pair of observations, and between the first group already formed, and each one of the other observations into consideration, based on one of the linkage methods adopted. In other words, we will have, after the first clustering stage, matrix D1 with dimensions (n  1)  (n  1), in which one of the rows will be represented by the first grouped pair of observations. Consequently, in the second stage, a new group will be formed by the grouping of two new observations or by adding a certain observation to the first group previously formed in the first stage. 4. The previous process must be repeated (n  1) times, until there is only a single group formed by all the observations. In other words, in the stage (n  2) we will have matrix Dn-2 that will only contain the distance between the last two remaining groups, before the final fusion. 5. Finally, from the clustering stages and the distances between the clusters formed, it is possible to develop a tree-shaped diagram that summarizes the clustering process, and explains the allocation of each observation in each cluster. This diagram is known as a dendrogram or a phenogram. Therefore, the values that form the D matrices of each one of the stages will be a function of the distance measure chosen and of the linkage method adopted. In a certain clustering stage s, imagine that a researcher groups two clusters M and N formed previously, containing observations m and n, respectively, so that cluster MN can be formed. Next, he intends to group MN with another cluster W, with w observations. Since we know that the decision to choose the next cluster will always be the smallest distance between each pair of observations or groups in the hierarchical agglomerative methods, the agglomeration schedule will be essential in order for the distances that will form each matrix Ds to be analyzed. Using this logic and based on Table 11.11, let’s discuss the criterion to calculate the distance between the clusters MN and W, inserted in matrix Ds, based on the linkage method: l

Nearest-Neighbor or Single-Linkage Method:

  dðMN ÞW ¼ min dMW ; dNW

(11.23)

where dMW and dNW are the distances between the closest observations in clusters M and W and in clusters N and W, respectively. l

Furthest-Neighbor or Complete-Linkage Method:

  dðMN ÞW ¼ max dMW ; dNW

(11.24)

where dMW and dNW are the distances between the farthest observations in clusters M and W and in clusters N and W, respectively.

328

PART

V

Multivariate Exploratory Data Analysis

TABLE 11.12 Example: Grades in Math, Physics, and Chemistry on the College Entrance Exam

l

Student (Observation)

Grade in Mathematics (X1i)

Grade in Physics (X2i)

Grade in Chemistry (X3i)

Gabriela

3.7

2.7

9.1

Luiz Felipe

7.8

8.0

1.5

Patricia

8.9

1.0

2.7

Ovidio

7.0

1.0

9.0

Leonor

3.4

2.0

5.0

Between-Groups or Average-Linkage Method: m +nX w X

dðMN ÞW ¼

dpq

p¼1 q¼1

ðm + nÞ  ðwÞ

(11.25)

where dpq represents the distance between any observation p in cluster MN and any observation q in cluster W, and m + n and w represent the number of observations in clusters MN and W, respectively. In the following section, we will present a practical example that will be solved algebraically, and from which the concepts of hierarchical agglomerative methods will be established. 11.2.2.1.2

A Practical Example of Cluster Analysis With Hierarchical Agglomeration Schedules

Imagine that a college professor, who is very concerned about his students’ capacity to learn the subject he teaches, Quantitative Methods, is interested in allocating them to groups with the highest homogeneity possible, based on the grades they obtained on the college entrance exams in subjects considered quantitative (Math, Physics, and Chemistry). In order to do that, the professor collected information on these grades, which vary from 0 to 10. In addition, since he will carry out a cluster analysis, first, in an algebraic way, he decided, for pedagogical purposes, to only work with five students. This dataset can be seen in Table 11.12. Based on the data obtained, the chart in Fig. 11.11 is constructed, and, since the variables are metric, the dissimilarity measure known as Euclidian distance will be used for the cluster analysis. Besides, since all the variables have values in the same unit of 0 measure (grades from 0 to 10), in this case, it will not be necessary to standardize them through Z-scores. In the following sections, hierarchical agglomeration schedules based on the Euclidian distance will be elaborated through the three linkage methods being studied. 11.2.2.1.2.1 Nearest-Neighbor or Single-Linkage Method At this moment, from the data presented in Table 11.12, let’s develop a cluster analysis through a hierarchical agglomeration schedule with the single-linkage method. First of all, we define matrix D0, formed by the Euclidian distances (dissimilarities) between each pair of observations, as follows:

Cluster Analysis Chapter

11

329

Chemistry

Gabriela Ovidio

Leonor

Patricia

Luiz Felipe

Physics Math FIG. 11.11 Three-dimensional chart with the relative position of the five students.

It is important to mention that at this initial moment each observation is considered an individual cluster, that is, in stage 0, we have 5 clusters (sample size). Highlighted in matrix D0 is the smallest distance between all the observations and, therefore, in the first stage, observations Gabriela and Ovidio will initially be grouped, and will now be a new cluster. We must construct matrix D1 so that we can go to the next clustering stage, in which the distances between the cluster Gabriela-Ovidio and the other observations are calculated. Observations that are still isolated. Thus, by using the singlelinkage method and based on the Expression (11.23), we have: dðGabrielaOvidioÞLuiz Felipe ¼ min f10:132; 10:290g ¼ 10:132 dðGabrielaOvidioÞPatricia ¼ min f8:420; 6:580g ¼ 6:580 dðGabrielaOvidioÞLeonor ¼ min f4:170; 5:474g ¼ 4:170 Matrix D1 can be seen:

330

PART

V

Multivariate Exploratory Data Analysis

In the same way, in matrix D1 the smallest distance between all of them is highlighted. Therefore, in the second stage, observation Leonor is inserted into the already-formed cluster Gabriela-Ovidio. Observations Luiz Felipe and Patricia still remain isolated. We must construct matrix D2 so that we can take the next step, in which the distances between the cluster GabrielaOvidio-Leonor and the two remaining observations are calculated. Analogously, we have: dðGabrielaOvidioLeonorÞLuizFelipe ¼ min f10:132; 8:223g ¼ 8:223 dðGabrielaOvidioLeonorÞPatricia ¼ min f6:580; 6:045g ¼ 6:045 Matrix D2 can be written as:

In the third clustering stage, observation Patricia is incorporated into the cluster Gabriela-Ovidio-Leonor, since the corresponding distance is the smallest among all the ones presented in matrix D2. Therefore, we can write matrix D3, which comes next, taking into consideration the following criterion: dðGabrielaOvidioLeonorPatriciaÞLuizFelipe ¼ min f8:223; 7:187g ¼ 7:187

Finally, in the fourth and last stage, all the observations are allocated to the same cluster, thus, concluding the hierarchical process. Table 11.13 presents a summary of this agglomeration schedule constructed by using the singlelinkage method. Based on this agglomeration schedule, we can construct a tree-shaped diagram, known as a dendrogram or phenogram, whose main objective is to illustrate the step by step of the clusters and to facilitate the visualization of how each observation is allocated to each stage. The dendrogram can be seen in Fig. 11.12. Through Figs. 11.13 and 11.14, we are able to interpret the dendrogram constructed. First of all, we drew three lines (I, II, and III) that are orthogonal to the dendrogram lines, as shown in Fig. 11.13, which allow us to identify the number of clusters in each clustering stage, as well as the observations in each cluster. Therefore, line I “cuts” the dendrogram immediately after the first clustering stage and, at this moment, we can verify that there are four clusters (four intersections with the dendrogram’s horizontal lines), one of them formed by observations Gabriela and Ovidio, and the others, by the individual observations.

Cluster Analysis Chapter

11

331

TABLE 11.13 Agglomeration Schedule Through the Single-Linkage Method Stage

Cluster

Grouped Observation

Smallest Euclidian Distance

1

Gabriela

Ovidio

3.713

2

Gabriela-Ovidio

Leonor

4.170

3

Gabriela-Ovidio-Leonor

Patricia

6.045

4

Gabriela-Ovidio-LeonorPatricia

Luiz Felipe

7.187

0

1

2

Euclidean distance 3 4 5

6

7

8

Gabriela Ovidio Leonor Patricia Luiz Felipe FIG. 11.12 Dendrogram—Single-linkage method.

3

4

Euclidean distance 5 6

7

8

3

4

Euclidean distance 5 6

7

8

FIG. 11.13 Interpreting the dendrogram—Number of clusters and allocation of observations.

Gabriela Ovidio Leonor Patricia Luiz Felipe

Gabriela Ovidio Leonor Patricia Luiz Felipe

FIG. 11.14 Interpreting the dendrogram—Distance leaps.

332

PART

V

Multivariate Exploratory Data Analysis

On the other hand, line II intersects three horizontal lines of the dendrogram, which means that, after the second stage, in which observation Leonor was incorporated into the already formed cluster Gabriela-Ovidio, there are three clusters. Finally, line III is drawn immediately after the third stage, in which observation Patricia merges with the cluster Gabriela-Ovidio-Leonor. Since two intersections between this line and the dendrogram’s horizontal lines are identified, we can see that observation Luiz Felipe remains isolated, while the others form a single cluster. Besides providing a study of the number of clusters in each clustering stage and of the allocation of observations, a dendrogram also allows the researcher to analyze the magnitude of the distance leaps in order to establish the clusters. A high magnitude leap, in comparison to the others, can indicate that a certain observation or cluster, a considerably different one, is incorporated into already formed clusters, which offers subsidies for the establishment of a solution regarding the number of clusters without the need for a next clustering stage. Although we know that setting an inflexible, mandatory number of clusters may hamper the analysis, at least giving an idea of this number, given the distance measure used and the linkage method adopted, may help researchers better understand the characteristics of the observations that led to this fact. Moreover, since the number of clusters is important for constructing nonhierarchical agglomeration schedules, this piece of information (considered an output of the hierarchical schedule) may serve as input for the k-means procedure. Fig. 11.14 presents three distance leaps (A, B, and C), regarding each one of the clustering stages, and, from their analysis, we can see that leap B, which represents the incorporation of observation Patricia into the cluster that had already been formed Gabriela-Ovidio-Leonor, is the greatest of the three. Therefore, in case we intend to set the ideal number of clusters in this example, the researcher may choose the solution with three clusters (line II in Fig. 11.13), without the stage in which observation Patricia is incorporated, since it possibly has characteristics that are not so homogeneous and that make it unfeasible to include it in the previously formed cluster, given the large distance leap. Thus, in this case, we would have a cluster formed by Gabriela, Ovidio, and Leonor, another one formed only by Patricia, and a third one formed only by Luiz Felipe. When using dissimilarity measures in methods clustering, a very useful criterion for identifying the number of clusters consists in identifying a considerable distance leap (whenever possible), and defining the number of clusters formed in the clustering stage immediately before the great leap, since very high leaps may incorporate observations with characteristics that are not so homogeneous. Furthermore, it is also important to mention that, if the distance leaps from a stage to another are small, due to the existence of variables with values that are too close to the observations, which can make it difficult to read the dendrogram, the researcher may use the squared Euclidean distance, so that the leaps can become clearer and better explained, making it easier to identify the clusters in the dendrogram, and providing better arguments for the decision making process. Software such as SPSS shows dendrograms with rescaled distance measures, in order to facilitate the interpretation of the allocation of each observation and the visualization of the large distance leaps. Fig. 11.15 illustrates how clusters can be established after the single-linkage method is elaborated. Next, we will develop the same example. However, now, let’s use the complete- and average-linkage methods, so that we can compare the order of the observations and the distance leaps. 11.2.2.1.2.2 Furthest-Neighbor or Complete-Linkage Method Matrix D0, shown here, is obviously the same, and the smallest Euclidian distance, the one highlighted, is between observations Gabriela and Ovidio that become the first cluster. It is important to emphasize that the first cluster will always be the same, regardless of the linkage method used, since the first stage will always consider the smallest distance between two pairs of observations, which are still isolated.

Cluster Analysis Chapter

11

333

Chemistry

Gabriela Ovidio

Leonor

Patricia

Luiz Felipe

Physics Math FIG. 11.15 Suggestion of clusters formed after the single-linkage method.

In the complete-linkage method, we must use Expression (11.24) to construct matrix D1, as follows: dðGabrielaOvidioÞLuizFelipe ¼ max f10:132; 10:290g ¼ 10:290 dðGabrielaOvidioÞPatricia ¼ max f8:420; 6:580g ¼ 8:420 dðGabrielaOvidioÞLeonor ¼ max f4:170; 5:474g ¼ 5:474 Matrix D1 can be seen and by analyzing it, we can see that observation Leonor will be incorporated into the cluster formed by Gabriela and Ovidio. Once again, the smallest value, among all the ones shown in matrix D1, is highlighted.

As verified when using the single-linkage method, here, observations Luiz Felipe and Patricia also remain isolated at this stage. The differences between the methods start arising now. Therefore, we will construct matrix D2 using the following criteria: dðGabrielaOvidioLeonorÞLuizFelipe ¼ max f10:290; 8:223g ¼ 10:290 dðGabrielaOvidioLeonorÞPatricia ¼ max f8:420; 6:045g ¼ 8:420

334

PART

V

Multivariate Exploratory Data Analysis

Matrix D2 can be written as follows:

In the third clustering stage, a new cluster is formed by the fusion of observations Patricia and Luiz Felipe, since the furthest-neighbor criterion adopted in the complete-linkage method makes the distance between these two observations become the smallest among all the ones calculated to construct matrix D2. Therefore, notice that at this stage differences related to the single-linkage method appear, in terms of the sorting and allocation of the observations to groups. Hence, to construct matrix D3, we must take the following criterion into consideration: dðGabrielaOvidioLeonorÞðLuizFelipePatriciaÞ ¼ max f10:290; 8:420g ¼ 10:290

In the same way, in the fourth and last stage, all the observations are allocated to the same cluster, since there is the clustering between Gabriela-Ovidio-Leonor and Luiz Felipe-Patricia. Table 11.14 shows a summary of this agglomeration schedule, elaborated by using the complete-linkage method. This agglomeration schedule’s dendrogram can be seen in Fig. 11.16. We can initially see that the sorting of the observations is different from what was observed in the dendrogram seen in Fig. 11.12. Analogous to what was carried out in the previous method, we chose to draw two vertical lines (I and II) over the largest distance leap, as shown in Fig. 11.17. Thus, if the researcher chooses to consider three clusters, the solution will be the same as the one achieved previously through the single-linkage method, one formed by Gabriela, Ovidio, and Leonor, another one by Luiz Felipe, and a third one by Patricia (line I in Fig. 11.17). However, if he chooses to define two clusters (line II), the solution will be different since, in this case, the second cluster will be formed by Luiz Felipe and Patricia, while in the previous case, it was formed only by Luiz Felipe, since observation Patricia was allocated to the first cluster.

TABLE 11.14 Agglomeration Schedule Through the Complete-Linkage Method Stage

Cluster

Grouped Observation

Smallest Euclidian Distance

1

Gabriela

Ovidio

3.713

2

Gabriela-Ovidio

Leonor

5.474

3

Luiz Felipe

Patricia

7.187

4

Gabriela-Ovidio-Leonor

Luiz Felipe-Patricia

10.290

Cluster Analysis Chapter

0

1

2

3

4

Euclidean distance 5 6

7

8

9

10

11

335

11

Gabriela Ovidio Leonor Luiz Felipe Patricia FIG. 11.16 Dendrogram—Complete-linkage method.

3

4

5

Euclidean distance 6 7 8

9

10

11

Gabriela Ovidio Leonor Luiz Felipe Patricia FIG. 11.17 Interpreting the dendrogram—Clusters and distance leaps.

Similar to what was done in the previous method, Fig. 11.18 illustrates how the clusters can be established after the complete-linkage method is carried out. Defining the clustering method can be based on the application of the average-linkage method, in which two groups merge based on the average distance between all the pairs of observations that belong to these groups. Therefore, as we have already discussed, if the most suitable method is the single linkage because there are observations considerably far apart from one another, the sorting and allocation of the observations will be maintained by the average-linkage method. On the other hand, the outputs of this method will show consistency with the solution achieved through the complete-linkage method as regards the sorting and allocation of the observations, if they are very similar in the variables in study. Thus, it is advisable for the researcher to apply the three linkage methods when elaborating a cluster analysis through hierarchical agglomeration schedules. Therefore, let’s move on to the average-linkage method. 11.2.2.1.2.3 Between-Groups or Average-Linkage Method First of all, let’s show the Euclidian distance matrix between each pair of observations (matrix D0), once again, highlighting the smallest distance between them.

336

PART

V

Multivariate Exploratory Data Analysis

Chemistry

Gabriela Ovidio

Leonor

Patricia

Luiz Felipe

Physics Math FIG. 11.18 Suggestion of clusters formed after the complete-linkage method.

By using Expression (11.25), we are able to calculate the terms of matrix D1, given that the first cluster GabrielaOvidio has already been formed. Thus, we have: 10:132 + 10:290 ¼ 10:211 2 8:420 + 6:580 ¼ 7:500 dðGabrielaOvidioÞPatricia ¼ 2 4:170 + 5:474 ¼ 4:822 dðGabrielaOvidioÞLeonor ¼ 2

dðGabrielaOvidioÞLuizFelipe ¼

Matrix D1 can be seen and, through it, we can see that observation Leonor is once again incorporated into the cluster formed by Gabriela and Ovidio. The smallest value among all the ones presented in matrix D1 has also been highlighted.

Cluster Analysis Chapter

11

337

In order to construct matrix D2, in which the distances between the cluster Gabriela-Ovidio-Leonor and the two remaining observations are calculated, we must perform the following calculations: 10:132 + 10:290 + 8:223 ¼ 9:548 3 8:420 + 6:580 + 6:045 dðGabrielaOvidioLeonorÞPatricia ¼ ¼ 7:015 3

dðGabrielaOvidioLeonorÞLuizFelipe ¼

Note that the distances used to calculate the dissimilarities to be inserted into matrix D2 are the original Euclidian distances between each pair of observations, that is, they come from matrix D0. Matrix D2 can be seen:

As verified when the single-linkage method was elaborated, here, observation Patricia is also incorporated into the cluster already formed by Gabriela, Ovidio and Leonor, and observation Luiz Felipe remains isolated. Finally, matrix D3 can be constructed from the following calculation: dðGabrielaOvidioLeonorPatriciaÞLuizFelipe ¼

10:132 + 10:290 + 8:223 + 7:187 ¼ 8:958 4

Once again, in the fourth and last stage, all the observations are in the same cluster. Table 11.15 and Fig. 11.19 present a summary of this agglomeration schedule and the corresponding dendrogram, respectively, resulting from this averagelinkage method.

TABLE 11.15 Agglomeration Schedule Through the Average-Linkage Method Stage

Cluster

Grouped Observation

Smallest Euclidian Distance

1

Gabriela

Ovidio

3.713

2

Gabriela-Ovidio

Leonor

4.822

3

Gabriela-Ovidio-Leonor

Patricia

7.015

4

Gabriela-Ovidio-Leonor-Patricia

Luiz Felipe

8.958

338

PART

V

Multivariate Exploratory Data Analysis

Euclidean distance 0

1

2

3

4

5

6

7

8

9

Gabriela Ovidio Leonor Patricia Luiz Felipe FIG. 11.19 Dendrogram—Average-linkage method.

Despite having other distance values, we can see that Table 11.15 and Fig. 11.19 show the same sorting and the same allocation of observations in the clusters as those presented in Table 11.13 and in Fig. 11.12, respectively, obtained when the single-linkage method was elaborated. Hence, we can state that the observations are significantly different from the variables studied, fact proven by the consistency of the answers obtained from the single- and average-linkage methods. If the observations were more similar, fact that has not been observed in the diagram seen in Fig. 11.11, the consistency of answers would occur between the completeand average-linkage methods, as already discussed. Therefore, when possible, the initial construction of scatter plots may help researchers, even if in a preliminary way, choose the method to be adopted. Hierarchical agglomeration schedules are very useful and offer us the possibility to analyze, in an exploratory way, the similarity between observations based on the behavior of certain variables. However, it is essential for researchers to understand that these methods are not conclusive by themselves and more than one answer may be obtained, depending on what is desired and on the data behavior. Besides, it is necessary for researchers to be aware of how sensitive these methods are to the presence of outliers. The existence of a very discrepant observation may cause other observations, not so similar to one another, to be allocated to the same cluster because they are extremely different from the observation considered an outlier. Hence, it is advisable to apply the hierarchical agglomeration schedules with the linkage method chosen several times, and, in each application, to identify one or more observations considered outliers. This procedure will make the cluster analysis become more reliable, since more and more homogeneous clusters may be formed. Researchers are free to characterize the most discrepant observation as the one that ended up becoming isolated after the penultimate clustering stage, that is, if it happens before the total fusion. Nonetheless, many are the methods to define an outlier. Barnett and Lewis (1994), for instance, mention almost 1000 articles in the existing literature on outliers, and, for pedagogical purposes, in the Appendix of this chapter, we will discuss an efficient procedure in Stata for detecting outliers when a researcher is carrying out a multivariate data analysis. It is also important to emphasize, as we have already discussed in this section, that different linkage methods, when elaborating hierarchical agglomeration schedules, must be applied to the same dataset, and the resulting dendrograms, compared. This procedure will help researchers in their decision-making processes with regard to choosing the ideal number of clusters, and also to sorting the observations and allocating each one of them to the different clusters formed. This will even allow researchers to make coherent decisions about the number of clusters that may be considered input in a possible nonhierarchical analysis. Last but not least, it is worth mentioning that the agglomeration schedules presented in this section (Tables 11.13, 11.14, and 11.15) provide increasing values of the clustering measures because a dissimilarity measure was used (Euclidian distance) as a comparison criterion between the observations. If we had chosen Pearson’s correlation between the observations, a similarity measure also used for metric variables, as we discussed in Section 11.2.1.1, the values of the clustering measures in the agglomeration schedules would be decreasing. The latter is also true for cluster analyses in which similarity measures are used, as the ones studied in Section 11.2.1.2, to assess the behavior of observations based on binary variables. In the following section we will develop the same example, in an algebraic way, using the nonhierarchical k-means agglomeration schedule.

11.2.2.2 Nonhierarchical K-Means Agglomeration Schedule Among all the nonhierarchical agglomeration schedules, the k-means procedure is the most often used by researchers in several fields of knowledge. Given that the number of clusters is previously defined by the researcher, this procedure can be

Cluster Analysis Chapter

11

339

elaborated after the application of a hierarchical agglomeration schedule when we have no idea of the number of clusters that can be formed, and, in this situation, the output obtained from this procedure can serve as input for the nonhierarchical.

11.2.2.2.1

Notation

As the one developed in Section 11.2.2.1.1, we now present a logical sequence of steps, based on Johnson and Wichern (2007), in order to facilitate the understanding of the cluster analysis (k-means procedure): 1. We define the initial number of clusters and the respective centroids. The main objective is to divide the observations from the dataset into K clusters, such that those within each cluster are the closest to each other if compared to any other that belongs to a different cluster. For such, the observations need to be allocated arbitrarily to the K clusters, so that the respective centroids can be calculated. 2. We must choose a certain observation that is closer to a centroid and reallocate it to this cluster. At this moment, another cluster has just lost that observation, and, therefore, the centroids of the cluster that receives it and of the cluster that loses it must be recalculated. 3. We must continue repeating the previous step until it is no longer possible to reallocate any observation due to its close proximity to a centroid from another cluster. Centroid coordinate x must be recalculated whenever including or excluding a certain observation p in the respective cluster, based on the following expressions: N  x + xp , if observation p is inserted into the cluster under analysis N+1 N  x  xp , if observation p is excluded from the cluster under analysis xnew ¼ N 1 xnew ¼

(11.26) (11.27)

where N and x refer to the number of observations in the cluster and to its centroid coordinate before the reallocation of that observation, respectively. In addition, xp refers to the coordinate of observation p, which changed clusters. For two variables (X1 and X2), Fig. 11.20 shows a hypothetical situation that represents the end of the k-means procedure, in which it is no longer possible to reallocate any observation because there are no more close proximities to centroids of other clusters. The matrix with distances between observations does not need to be defined at each step, different from hierarchical agglomeration schedules, which reduces the requirements in terms of technological capabilities, allowing nonhierarchical agglomeration schedules to be applied to considerably larger dataset than those traditionally studied through hierarchical schedules. FIG. 11.20 Hypothetical situation that represents the end of the K-means procedure.

340

PART

V

Multivariate Exploratory Data Analysis

In addition, bear in mind that the variables must be standardized before elaborating the k-means procedure, and in the hierarchical agglomeration schedules too, if the respective values are not in the same unit of measure. Finally, after concluding this procedure, it is important for researchers to analyze if the values of a certain metric variable differ between the groups defined, that is, if the variability between the clusters is significantly higher than the internal variability of each cluster. The F-test of the one-way analysis of variance, or one-way ANOVA, allows us to develop this analysis, and its null and alternative hypotheses can be defined as follows: H0: the variable under analysis has the same mean in all the groups formed. H1: the variable under analysis has a different mean in at least one of the groups in relation to the others. Therefore, a single F-test can be applied for each variable, aiming to assess the existence of at least one difference among all the comparison possibilities, and, in this regard, the main advantage of applying it is the fact that adjustments in the discrepant dimensions of the groups do not need to be carried out to analyze several comparisons. On the other hand, rejecting the null hypothesis at a certain significance level, does not allow the researcher to know which group(s) is(are) statistically different from the others in relation to the variable being analyzed. The F statistical expression, corresponding to this test, is given by the following expression: K X



2 Nk  Xk  X

k¼1



variability between the groups ¼ X K  1 2 variability within the groups Xki  Xk

(11.28)

ki

nK where N is the number of observations in the k-th cluster, Xk is the mean of variable X in the same k-th cluster, X is the general average of variable X, and Xki is the value that variable X takes on for a certain observation i present in the k-th cluster. In addition, K represents the number of clusters to be compared, and n, the sample size. By using the F statistic, researchers will be able to identify the variables whose means most differ between the groups, that is, those that most contribute to the formation of at least one of the K clusters (highest F statistic), as well as those that do not contribute to the formation of the suggested number of clusters, at a certain significance level. In the following section, we will discuss a practical example that will be solved algebraically, and from which the concepts of the k-means procedure may be established. 11.2.2.2.2 A Practical Example of a Cluster Analysis With the Nonhierarchical K-Means Agglomeration Schedule To solve the nonhierarchical k-means agglomeration schedule algebraically, let’s use the data from our own example, which can be found in Table 11.12 and are shown in Table 11.16. Software packages such as SPSS use the Euclidian distance as the standard dissimilarity measure, reason why we will develop the algebraic procedures based on this measure. This criterion will even allow the results obtained to be compared to the ones found when elaborating the hierarchical agglomeration schedules in Section 11.2.2.1.2, as, in those situations, the Euclidian distance was also used. In the same way, it will not be necessary to standardize the variables through Z-scores, since all of them are in the same unit of measure (grades from 0 to 10). Otherwise, it is crucial for researchers to standardize the variables before elaborating the k-means procedure. TABLE 11.16 Example: Grades in Math, Physics, and Chemistry on the College Entrance Exams Student (Observation)

Grade in Mathematics (X1i)

Grade in Physics (X2i)

Grade in Chemistry (X3i)

Gabriela

3.7

2.7

9.1

Luiz Felipe

7.8

8.0

1.5

Patricia

8.9

1.0

2.7

Ovidio

7.0

1.0

9.0

Leonor

3.4

2.0

5.0

Cluster Analysis Chapter

11

341

TABLE 11.17 Arbitrary Allocation of the Observations in K 5 3 Clusters and Calculation of the Centroid Coordinates— Initial Step of the K-Means Procedure Centroid Coordinates Variable Cluster

Grade in Mathematics

Grade in Physics

Grade in Chemistry

Gabriela

3:7 + 7:8 ¼ 5:75 2

2:7 + 8:0 ¼ 5:35 2

9:1 + 1:5 ¼ 5:30 2

8:9 + 7:0 ¼ 7:95 2

1:0 + 1:0 ¼ 1:00 2

2:7 + 9:0 ¼ 5:85 2

3.40

2.00

5.00

Luiz Felipe Patricia Ovidio Leonor

Using the logical sequence presented in Section 11.2.2.2.1, we will develop the k-means procedure with K ¼ 3 clusters. This number of clusters may have come from a decision made by the researcher and based on a certain preliminary criterion, or it was chosen based on the outputs of the hierarchical agglomeration schedules. In our case, the decision was made based on the comparison of the dendrograms that had already been constructed, and by the similarity of the outputs obtained by the single- and average-linkage methods. Thus, we need to arbitrarily allocate the observations to three clusters, so that the respective centroids can be calculated. Therefore, we can establish that observations Gabriela and Luiz Felipe form the first cluster, Patricia and Ovidio, the second, and Leonor, the third. Table 11.17 shows the arbitrary formation of these preliminary clusters, as well as the calculation of the respective centroid coordinates, which makes the initial step of the k-means procedure algorithm possible. Based on these coordinates, we constructed the chart seen in Fig. 11.21, which shows the arbitrary allocation of each observation to its cluster and the respective centroids. Based on the second step of the logical sequence presented in Section 11.2.2.2.1, we must choose a certain observation and calculate the distance between it and all the cluster centroids, assuming that it is or it is not reallocated to each cluster. Selecting the first observation (Gabriela), for example, we can calculate the distances between it and the centroids of the clusters that have already been formed (Gabriela-Luiz Felipe, Patricia-Ovidio, and Leonor) and, after that, assume that it leaves its cluster (Gabriela-Luiz Felipe), and is inserted into one of the other two clusters, forming the cluster GabrielaPatricia-Ovidio or Gabriela-Leonor. Thus, from Expressions (11.26) and (11.27), we must recalculate the new centroid coordinates, simulating that, in fact, the reallocation of Gabriela to one of the two clusters takes place, as shown in Table 11.18. Thus, from Tables 11.16, 11.17, and 11.18, we can calculate the following Euclidian distances: l

Assumption that Gabriela is not reallocated: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dGabrielaðGabrielaLuizFelipeÞ ¼ ð3:70  5:75Þ2 + ð2:70  5:35Þ2 + ð9:10  5:30Þ2 ¼ 5:066 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dGabrielaðPatriciaOvidioÞ ¼ ð3:70  7:95Þ2 + ð2:70  1:00Þ2 + ð9:10  5:85Þ2 ¼ 5:614 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dGabrielaLeonor ¼ ð3:70  3:40Þ2 + ð2:70  2:00Þ2 + ð9:10  5:00Þ2 ¼ 4:170

l

Assumption that Gabriela is reallocated: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð3:70  7:80Þ2 + ð2:70  8:00Þ2 + ð9:10  1:50Þ2 ¼ 10:132 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dGabrielaðGabrielaPatriciaOvidioÞ ¼ ð3:70  6:53Þ2 + ð2:70  1:57Þ2 + ð9:10  6:93Þ2 ¼ 3:743 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dGabrielaðGabrielaLeonorÞ ¼ ð3:70  3:55Þ2 + ð2:70  2:35Þ2 + ð9:10  7:05Þ2 ¼ 2:085 dGabrielaLuizFelipe ¼

342

PART

V

Multivariate Exploratory Data Analysis

Chemistry

Gabriela Ovidio

CENTROID 2 CENTROID 1 Leonor

Patricia

Luiz Felipe

Physics Math FIG. 11.21 Arbitrary allocation of the observations in K ¼ 3 clusters and respective centroids—Initial step of the K-means procedure.

TABLE 11.18 Simulating the Reallocation of Gabriela and Calculating the New Centroid Coordinates Centroid Coordinates Variable Simulation

Grade in Mathematics

Grade in Physics

Grade in Chemistry

Luiz Felipe

Excluding Gabriela

2  ð5:75Þ3:70 ¼ 7:80 21

2  ð5:35Þ2:70 ¼ 8:00 21

2  ð5:30Þ9:10 ¼ 1:50 21

Gabriela

Including Gabriela

2  ð7:95Þ + 3:70 ¼ 6:53 2+1

2  ð1:00Þ + 2:70 ¼ 1:57 2+1

2  ð5:85Þ + 9:10 ¼ 6:93 2+1

Including Gabriela

1  ð3:40Þ + 3:70 ¼ 3:55 1+1

1  ð2:00Þ + 2:70 ¼ 2:35 1+1

1  ð5:00Þ + 9:10 ¼ 7:05 1+1

Cluster

Patricia Ovidio Gabriela Leonor Obs.: Note that the values calculated for the Luiz Felipe centroid coordinates are exactly the same as this observation’s original coordinates, as shown in Table 11.16.

Since Gabriela is the closest to the Gabriela-Leonor centroid (the shortest Euclidian distance), we must reallocate this observation to the cluster initially formed only by Leonor. So, the cluster in which observation Gabriela was at first (Gabriela-Luiz Felipe) has just lost it, and now Luiz Felipe has become an individual cluster. Therefore, the centroids of the cluster that receives it and the one that loses it must be recalculated. Table 11.19 shows the creation of the new clusters and the calculation of the respective centroid coordinates too.

Cluster Analysis Chapter

11

343

TABLE 11.19 New Centroids With the Reallocation of Gabriela Centroid Coordinates Variable Cluster

Grade in Mathematics

Grade in Physics

Grade in Chemistry

Luiz Felipe

7.80

8.00

1.50

Patricia

7.95

1.00

5.85

3:7 + 3:4 ¼ 3:55 2

2:7 + 2:0 ¼ 2:35 2

9:1 + 5:0 ¼ 7:05 2

Ovidio Gabriela Leonor

Based on these new coordinates, we can construct the chart shown in Fig. 11.22. Once again, let’s repeat the previous step. At this moment, since observation Luiz Felipe is isolated, let’s simulate the reallocation of the third observation (Patricia). We must calculate the distances between it and the centroids of the clusters that have already been formed (Luiz Felipe, Patricia-Ovidio, and Gabriela-Leonor) and, afterwards, assume that it leaves its cluster (Patricia-Ovidio) and is inserted into one of the other two clusters, forming the cluster Luiz Felipe-Patricia or Gabriela-Patricia-Leonor. Also based on Expressions (11.26) and (11.27), we must recalculate the new centroid coordinates, simulating that, in fact, the reallocation of Patricia to one of these two clusters happens, as shown in Table 11.20. Similar to what was carried out when simulating Gabriela’s reallocation, based on Tables 11.16, 11.19, and 11.20, let’s calculate the Euclidian distances between Patricia and each one of the centroids: Chemistry

Gabriela Ovidio

CENTROID 3

CENTROID 2 Leonor

Patricia

Luiz Felipe

Physics Math FIG. 11.22 New clusters and respective centroids—Reallocation of Gabriela.

344

PART

V

Multivariate Exploratory Data Analysis

TABLE 11.20 Simulation of Patricia’s Reallocation—Next Step of the K-Means Procedure Algorithm Centroid Coordinates Variable Cluster

Simulation

Grade in Mathematics

Grade in Physics

Grade in Chemistry

Luiz Felipe

Including Patricia

1  ð7:80Þ + 8:90 ¼ 8:35 1+1

1  ð8:00Þ + 1:00 ¼ 4:50 1+1

1  ð1:50Þ + 2:70 ¼ 2:10 1+1

Ovidio

Excluding Patricia

2  ð7:95Þ8:90 ¼ 7:00 21

2  ð1:00Þ1:00 ¼ 1:00 21

2  ð5:85Þ2:70 ¼ 9:00 21

Gabriela

Including Patricia

2  ð3:55Þ + 8:90 ¼ 5:33 2+1

2  ð2:35Þ + 1:00 ¼ 1:90 2+1

2  ð7:05Þ + 2:70 ¼ 5:60 2+1

Patricia

Patricia Leonor Obs.: Note that the values calculated of the Ovidio centroid coordinates are exactly the same as this observation’s original coordinates, as shown in Table 11.16.

l

Assumption that Patricia is not reallocated: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð8:90  7:80Þ2 + ð1:00  8:00Þ2 + ð2:70  1:50Þ2 ¼ 7:187 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dPatriciaðPatriciaOvidioÞ ¼ ð8:90  7:95Þ2 + ð1:00  1:00Þ2 + ð2:70  5:85Þ2 ¼ 3:290 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dPatriciaðGabrielaLeonorÞ ¼ ð8:90  3:55Þ2 + ð1:00  2:35Þ2 + ð2:70  7:05Þ2 ¼ 7:026 dPatriciaLuizFelipe ¼

l

Assumption that Patricia is reallocated: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dPatriciaðLuizFelipePatriciaÞ ¼ ð8:90  8:35Þ2 + ð1:00  4:50Þ2 + ð2:70  2:10Þ2 ¼ 3:593 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dPatriciaOvidio ¼ ð8:90  7:00Þ2 + ð1:00  1:00Þ2 + ð2:70  9:00Þ2 ¼ 6:580 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dPatriciaðGabrielaPatriciaLeonorÞ ¼ ð8:90  5:33Þ2 + ð1:00  1:90Þ2 + ð2:70  5:60Þ2 ¼ 4:684

Bearing in mind that the Euclidian distance between Patricia and the cluster Patricia-Ovidio is the shortest, we have to reallocate it to another cluster and, at this moment, let’s maintain the solution presented in Table 11.19 and in Fig. 11.22. Next, we will develop the same procedure, however, simulating the reallocation of the fourth observation (Ovidio). Analogously, we must, therefore, calculate the distances between this observation and the centroids of the clusters that have already been formed (Luiz Felipe, Patricia-Ovidio, and Gabriela-Leonor) and, after that, assume that it leaves its cluster (Patricia-Ovidio) and is inserted into one of the other two clusters, forming the cluster Luiz Felipe-Ovidio or Gabriela-Ovidio-Leonor. Once again by using Expressions (11.26) and (11.27), we can recalculate the new centroid coordinates, simulating that, in fact, the reallocation of Ovidio to one of these two clusters takes place, as shown in Table 11.21. Next, we can see the calculations of the Euclidian distances between Ovidio and each one of the centroids, defined from Tables 11.16, 11.19, and 11.21: l

Assumption that Ovidio is not reallocated: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dOvidioLuizFelipe ¼ ð7:00  7:80Þ2 + ð1:00  8:00Þ2 + ð9:00  1:50Þ2 ¼ 10:290

Cluster Analysis Chapter

11

345

TABLE 11.21 Simulating Ovidio’s Reallocation—New Step of the K-Means Procedure Algorithm Centroid Coordinates Variable Cluster

Simulation

Grade in Mathematics

Grade in Physics

Grade in Chemistry

Luiz Felipe

Including Ovidio

1  ð7:80Þ + 7:00 ¼ 7:40 1+1

1  ð8:00Þ + 1:00 ¼ 4:50 1+1

1  ð1:50Þ + 9:00 ¼ 5:25 1+1

Patricia

Excluding Ovidio

2  ð7:95Þ7:00 ¼ 8:90 21

2  ð1:00Þ1:00 ¼ 1:00 21

2  ð5:85Þ9:00 ¼ 2:70 21

Gabriela

Including Ovidio

2  ð3:55Þ + 7:00 ¼ 4:70 2+1

2  ð2:35Þ + 1:00 ¼ 1:90 2+1

2  ð7:05Þ + 9:00 ¼ 7:70 2+1

Ovidio

Ovidio Leonor Obs.: Note that the values calculated of the Patricia centroid coordinates are exactly the same as this observation’s original coordinates, as shown in Table 11.16.

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dOvidioðPatriciaOvidioÞ ¼ ð7:00  7:95Þ2 + ð1:00  1:00Þ2 + ð9:00  5:85Þ2 ¼ 3:290 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dOvidioðGabrielaLeonorÞ ¼ ð7:00  3:55Þ2 + ð1:00  2:35Þ2 + ð9:00  7:05Þ2 ¼ 4:187

l

Assumption that Ovidio is reallocated: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dOvidioðLuizFelipeOvidioÞ ¼ ð7:00  7:40Þ2 + ð1:00  4:50Þ2 + ð9:00  5:25Þ2 ¼ 5:145 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dOvidioPatricia ¼ ð7:00  8:90Þ2 + ð1:00  1:00Þ2 + ð9:00  2:70Þ2 ¼ 6:580 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dOvidioðGabrielaOvidioLeonorÞ ¼ ð7:00  4:70Þ2 + ð1:00  1:90Þ2 + ð9:00  7:70Þ2 ¼ 2:791

In this case, since observation Ovidio is the closest to the centroid of Gabriela-Ovidio-Leonor (the shortest Euclidian distance), we must reallocate this observation to the cluster formed originally by Gabriela and Leonor. Therefore, observation Patricia becomes an individual cluster. Table 11.22 shows the centroid coordinates of clusters Luiz Felipe, Patricia, and Gabriela-Ovidio-Leonor. We will not carry out the procedure proposed for the fifth observation (Leonor), since it had already fused with observation Gabriela in the first step of the algorithm. We can consider that the k-means procedure is concluded, since it is no

TABLE 11.22 New Centroids With Ovidio’s Reallocation Centroid Coordinates Variable Cluster

Grade in Mathematics

Grade in Physics

Grade in Chemistry

Luiz Felipe

7.80

8.00

1.50

Patricia

8.90

1.00

2.70

Gabriela

4.70

1.90

7.70

Ovidio Leonor

346

PART

V

Multivariate Exploratory Data Analysis

Chemistry

Gabriela Ovidio CENTROID

Leonor

Patricia

Luiz Felipe

Physics Math FIG. 11.23 Solution of the K-means procedure.

longer possible to reallocate any observation due to closer proximity to another cluster’s centroid. Fig. 11.23 shows the allocation of each observation to its cluster and their respective centroids. Note that the solution achieved is equal to the one reached through the single- (Fig. 11.15) and average-linkage methods, when we elaborated the hierarchical agglomeration schedules. As we have already discussed, we can see that the matrix with the distances between the observations does not need to be defined at each step of the k-means procedure algorithm, different from the hierarchical agglomeration schedules, which reduces the requirements in terms of technological capabilities, allowing nonhierarchical agglomeration schedules to be applied to dataset significantly larger than the ones traditionally studied through hierarchical schedules. Table 11.23 shows the Euclidian distances between each observation of the original dataset and the centroids of each one of the clusters formed.

TABLE 11.23 Euclidian Distances Between Observations and Cluster Centroids Cluster Student (Observation)

Luiz Felipe

Patricia

Gabriela Ovidio Leonor

Gabriela

10.132

8.420

1.897

Luiz Felipe

0.000

7.187

9.234

Patricia

7.187

0.000

6.592

Ovidio

10.290

6.580

2.791

Leonor

8.223

6.045

2.998

Cluster Analysis Chapter

11

347

TABLE 11.24 Means per Cluster and General Mean of the Variable mathematics Cluster 1

Cluster 2

Cluster 3

XLuiz Felipe ¼ 7:80

XPatricia ¼ 8:90

XGabriela ¼ 3:70 XOvidio ¼ 7:00 XLeonor ¼ 3:40

X 1 ¼ 7:80

X 2 ¼ 8:90

X 3 ¼ 4:70

X ¼ 6:16

We would like to emphasize that this algorithm can be elaborated with another preliminary allocation of the observations to the clusters besides the one chosen in this example. Reapplying the k-means procedure with several arbitrary choices, given K clusters, allows the researcher to assess how stable the clustering procedure is, and to underpin the allocation of the observations to the groups in a consistent way. After concluding this procedure, it is essential to check, through the F-test of one-way ANOVA, if the values of each one of the three variables considered in the analysis are statistically different between the three clusters. To make the calculation of the F statistics that correspond to this test easier, we constructed Tables 11.24, 11.25, and 11.26, which show the means per cluster and the general mean of the variables mathematics, physics, and chemistry, respectively. So, based on the values presented in these tables and by using Expression (11.28), we are able to calculate the variation between the groups and within them for each one of the variables, as well as the respective F statistics. Tables 11.27, 11.28, and 11.29 show these calculations. Now, let’s analyze the rejection or not of the null hypothesis of the F-tests for each one of the variables. Since there are two degrees of freedom for the variability between the groups (K – 1 ¼ 2) and two degrees of freedom for the variability within the groups (n – K ¼ 2), by using Table A in the Appendix, we have Fc ¼ 19.00 (critical F at a significance level of 0.05). Therefore, only for the variable physics can we reject the null hypothesis that all the groups formed have the same

TABLE 11.25 Means per Cluster and General Mean of the Variable physics Cluster 1

Cluster 2

Cluster 3

XLuiz Felipe ¼ 8:00

XPatricia ¼ 1:00

XGabriela ¼ 2:70 XOvidio ¼ 1:00 XLeonor ¼ 2:00

X 1 ¼ 8:00

X 2 ¼ 1:00

X 3 ¼ 1:90

X ¼ 2:94

TABLE 11.26 Means per Cluster and General Mean of the Variable chemistry Cluster 1

Cluster 2

Cluster 3

XLuiz Felipe ¼ 1:50

XPatricia ¼ 2:70

XGabriela ¼ 9:10 XOvidio ¼ 9:00 XLeonor ¼ 5:00

X 1 ¼ 1:50 X ¼ 5:46

X 2 ¼ 2:70

X 3 ¼ 7:70

348

PART

V

Multivariate Exploratory Data Analysis

TABLE 11.27 Variation and F Statistic for the Variable mathematics Variability between the groups

ð7:806:16Þ2 + ð8:906:16Þ2 + 3ð4:706:16Þ2 31 2

2

Variability within the groups

ð3:704:70Þ + ð7:004:70Þ + ð3:404:70Þ 53

F

8:296 3:990 ¼ 2:079

2

¼ 8:296

¼ 3:990

Note: The calculation of the variability within the groups only took cluster 3 into consideration, since the others show variability equal to 0, because they are formed by a single observation.

TABLE 11.28 Variation and F Statistic for the Variable physics Variability between the groups

ð8:002:94Þ2 + ð1:002:94Þ2 + 3ð1:902:94Þ2 31 2

2

Variability within the groups

ð2:701:90Þ + ð1:001:90Þ + ð2:001:90Þ 53

F

16:306 0:730 ¼ 22:337

2

¼ 16:306

¼ 0:730

Note: The same as the previous table.

TABLE 11.29 Variation and F Statistic for the Variable chemistry Variability between the groups

ð1:505:46Þ2 + ð2:705:46Þ2 + 3ð7:705:46Þ2 31 2

2

Variability within the groups

ð9:107:70Þ + ð9:007:70Þ + ð5:007:70Þ 53

F

19:176 5:470 ¼ 3:506

2

¼ 19:176

¼ 5:470

Note: The same as Table 11.27.

mean, since F calculated Fcal ¼ 22.337 > Fc ¼ F2,2,5% ¼ 19.00, So, for this variable, there is at least one group that has a mean that is statistically different from the others. For the variables mathematics and chemistry, however, we cannot reject the test’s null hypothesis at a significance level of 0.05. Software such as SPSS and Stata do not offer the Fc for the defined degrees of freedom and a certain significance level. However, they offer the Fcal significance level for these degrees of freedom. Thus, instead of analyzing if Fcal > Fc, we must verify if the Fcal significance level is less than 0.05 (5%). Therefore: If Sig. F (or Prob. F) < 0.05, there is at least one difference between the groups for the variable under analysis. The Fcal significance level can be obtained in Excel by using the command Formulas ! Insert Function ! FDIST, which will open a dialog box as the one shown in Fig. 11.24. As we can see in this figure, sig. F for the variable physics is less than 0.05 (sig. F ¼ 0.043), that is, there is at least one difference between the groups for this variable at a significance level of 0.05. An inquisitive researcher will be able to carry out the same procedure for the variables mathematics and chemistry. In short, Table 11.30 presents the results of the oneway ANOVA, with the variation of each variable, the F statistics, and the respective significance levels. The one-way ANOVA table still allows the researcher to identify the variables that most contribute to the formation of at least one of the clusters, because they have a mean that is statistically different from at least one of the groups in relation to the others, since they will have greater F statistic values. It is important to mention that F statistic values are very sensitive to the sample size, and, in this case, the variables mathematics and chemistry ended up not having statistically different means among the three groups, mainly because the sample is small (only five observations). We would like to emphasize that this one-way ANOVA can also be carried out soon after the application of a certain hierarchical agglomeration schedule, since it only depends on the classification of the observations within groups. The researcher must be careful about only one thing, when comparing the results obtained by a hierarchical schedule to the ones obtained by a nonhierarchical schedule, to use the same distance measure in both situations. Different allocations of the observations to the same number of clusters may happen if different distance measures are used in

Cluster Analysis Chapter

11

349

FIG. 11.24 Obtaining the F significance level (command Insert Function).

TABLE 11.30 One-way Analysis of Variance (ANOVA) Variable mathematics

Variability Between the Groups

Variability Within the Groups

F

Sig. F

8.296

3.990

2.079

0.325

physics

16.306

0.730

22.337

0.043

chemistry

19.176

5.470

3.506

0.222

a hierarchical schedule and in a nonhierarchical schedule. Therefore, different values of the F statistics in both situations can be calculated. In general, in case there are one or more variables that do not contribute to the formation of the suggested number of clusters, we recommend that the procedure be reapplied without it (or them). In these situations, the number of clusters may change and, if the researcher feels the need to underpin the initial input regarding the number of K clusters, he may even use a hierarchical agglomeration schedule without those variables before reapplying the k-means procedure, which will make the analysis cyclical. Moreover, the existence of outliers may generate considerably disperse clusters, and treating the dataset in order to identify extremely discrepant observations becomes an advisable procedure, before elaborating nonhierarchical agglomeration schedules. In the Appendix of this chapter, an important procedure in Stata for detecting multivariate outliers will be presented. As with hierarchical agglomeration schedules, the nonhierarchical k-means schedule cannot be used as an isolated technique to make a conclusive decision about the clustering of observations. The data behavior, sample size, and criteria adopted by the researcher may be extremely sensitive to the allocation of observations and the formation of clusters. The combination of the outputs found with the ones coming from other techniques can more powerfully underpin the choices made by the researcher, and provide higher transparency in the decision-making process. At the end of the cluster analysis, since the clusters formed can be represented in the dataset by a new qualitative variable with terms connected to each observation (cluster 1, cluster 2, ..., cluster K), other exploratory multivariate techniques can be elaborated from it, as, for example, a correspondence analysis, so that, depending on the researcher’s objectives, we can study a possible association between the clusters and the categories of other qualitative variables. This new qualitative variable, which represents the allocation of each observation, may also be used as an explanatory variable of a certain phenomenon in confirmatory multivariate models as, for example, multiple regression models, as long

350

PART

V

Multivariate Exploratory Data Analysis

as it is transformed into dummy variables that represent the categories (clusters) of this new variable generated in the cluster analysis, as we will study in Chapter 13. On the other hand, such a procedure only makes sense when we intend to propose a diagnostic regarding the behavior of the dependent variable, without aiming at having forecasts. Since a new observation does not have its place in a certain cluster, obtaining its allocation is only possible when we include such observation into a new cluster analysis, in order to obtain a new qualitative variable and, consequently, new dummies. In addition, this new qualitative variable can also be considered dependent on a multinomial logistic regression model, allowing the researcher to evaluate the probabilities each observation has to belong to each one of the clusters formed, due to the behavior of other explanatory variables not initially considered in the cluster analysis. We would also like to highlight that this procedure depends on the research objectives and construct established, and has a diagnostic nature as regards the behavior of the variables in the sample for the existing observations, without a predictive purpose. Finally, if the clusters formed present substantiality in relation to the number of observations allocated, by using other variables, we may even apply specific confirmatory techniques for each cluster identified, so that, possibly, better adjusted models can be generated. Next, the same dataset will be used to run cluster analyses in SPSS and Stata. In Section 11.3, we will discuss the procedures for elaborating the techniques studied in SPSS and their results too. In Section 11.4, we will study the commands to perform the procedures in Stata, with the respective outputs.

11.3 CLUSTER ANALYSIS WITH HIERARCHICAL AND NONHIERARCHICAL AGGLOMERATION SCHEDULES IN SPSS In this section, we will discuss the step by step for elaborating our example in the IBM SPSS Statistics Software. The main objective is to offer the researcher an opportunity to run cluster analyses with hierarchical and nonhierarchical schedules in this software package, given how easy it is to use it and how didactical the operations are. Every time an output is shown, we will mention the respective result obtained when performing the algebraic solution in the previous sections, so that the researcher can compare them and increase his own knowledge on the topic. The use of the images in this section has been authorized by the International Business Machines Corporation©.

11.3.1

Elaborating Hierarchical Agglomeration Schedules in SPSS

Going back to the example presented in Section 11.2.2.1.2, remember that our professor is interested in grouping students in homogeneous clusters based on their grades (from 0 to 10) obtained on the college entrance exams, in Mathematics, Physics, and Chemistry. The data can be found in the file CollegeEntranceExams.sav and they are exactly the same as the ones presented in Table 11.12. In this section, we will carry out the cluster analysis using the Euclidian distance between the observations and only considering the single-linkage method. In order for a cluster analysis to be elaborated through a hierarchical method in SPSS, we must click on Analyze → Classify → Hierarchical Cluster.... A dialog box as the one shown in Fig. 11.25 will open. Next, we must insert the original variables from our example (mathematics, physics, and chemistry) into Variables and the variable that identifies the observations (student) in Label Cases by, as shown in Fig. 11.26. If the researcher does not have a variable that represents the name of the observations (in this case, a string), he may leave this last cell blank. First of all, in Statistics..., let’s choose the options Agglomeration schedule and Proximity matrix, which make the table with the agglomeration schedule be presented in the outputs, constructed based on the distance measure to be chosen and on the linkage method to be defined, and the matrix with the distances between each pair of observations, respectively. Let’s maintain the option None in Cluster Membership. Fig. 11.27 shows how this dialog box will be. When we click on Continue, we will go back to the main dialog box of the hierarchical cluster analysis. Next, we must click on Plots.... As seen in Fig. 11.28, let’s select the option Dendrogram and the option None in Icicle. In the same way, let’s click on Continue, so that we can go back to the main dialog box. In Method..., which is the most important dialog box of the hierarchical cluster analysis, we must choose the singlelinkage method, also known as the nearest neighbor. Thus, in Cluster Method, let’s select the option Nearest neighbor. An inquisitive researcher may see that the complete (Furthest neighbor) and average (Between-groups linkage) linkage methods, discussed in Section 11.2.2.1, are also available in this option. Besides, since the variables in the dataset are metric, we have to choose one of the dissimilarity measures found in Measure → Interval. In order to maintain the same logic used when solving our example algebraically, we will choose the Euclidian distance as a dissimilarity measure and, therefore, we must select the option Euclidean distance. We can also see that, in this option, we can find the other dissimilarity measures studied in Section 11.2.1.1, such as, the squared

Cluster Analysis Chapter

11

351

FIG. 11.25 Dialog box for elaborating the cluster analysis with a hierarchical method in SPSS.

FIG. 11.26 Selecting the original variables.

Euclidean distance, Minkowski, Manhattan (Block, in SPSS), Chebyshev, and Pearson’s correlation that, even though is a similarity measure, is also used for metric variables. Although we do not use similarity measures in this example because we are not working with binary variables, it is important to mention that some similarity measures can be selected if necessary. Hence, as discussed in Section 11.2.1.2, in Measure → Binary, we can select the simple matching, Jaccard, Dice, Anti-Dice (Sokal and Sneath 2, in SPSS), Russell and Rao, Ochiai, Yule (Yule’s Q, in SPSS), Rogers and Tanimoto, Sneath and Sokal (Sokal and Sneath 1, in SPSS), and Hamann coefficients, among others.

352

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.27 Selecting the options that generate the agglomeration schedule and the matrix with the distances between the pairs of observations.

FIG. 11.28 Selecting the option that generates the dendrogram.

Cluster Analysis Chapter

11

353

FIG. 11.29 Dialog box for selecting the linkage method and the distance measure.

Still in the same dialog box, the researcher may request that the cluster analysis be elaborated from standardized variables. If necessary, for situations in which the original variables have different measurement units, the option Z scores in Transform Values → Standardize can be selected, which will make all the calculations be elaborated from the standardization of the variables, and which will begin having means equal to 0 and standard deviations equal to 1. After these considerations, the dialog box in our example will become what can be seen in Fig. 11.29. Next, we can click on Continue and on OK. The first output (Fig. 11.30) shows dissimilarity matrix D0 formed by the Euclidian distances between each pair of observations. We can even see that in the legend it says, “This is a dissimilarity matrix.” If this matrix were formed by similarity measures, resulting from calculations elaborated from binary variables, it would say, “This is a similarity matrix.”

FIG. 11.30 Matrix with Euclidian distances (dissimilarity measures) between pairs of observations.

354

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.31 Hierarchical agglomeration schedule—Single-linkage method and Euclidian distance.

Through this matrix, which is equal to the one whose values were calculated and presented in Section 11.2.2.1.2, we can verify that observations Gabriela and Ovidio are the most similar (the smallest Euclidian distance) in relation to the variables mathematics, physics, and chemistry (dGabrielaOvidio ¼ 3.713). Therefore, in the hierarchical schedule shown in Fig. 11.31, the first clustering stage occurs exactly by joining these two students, with Coefficient (Euclidian distance) equal to 3.713. Note that the columns Cluster Combined Cluster 1 and Cluster 2 refer to the isolated observations, when they are still not incorporated into a certain cluster or clusters that have already been formed. Obviously, in the first clustering stage, the first cluster is formed by the fusion of two isolated observations. Next, in the second stage, observation Leonor (5) is incorporated into the cluster previously formed by Gabriela (1) and Ovidio (4). With regard to the single-linkage method, we can see that the distance considered for the agglomeration of Leonor was the smallest between this observation and Gabriela or Ovidio, that is, the criterion adopted it was: dðGabrielaOvidioÞLeonor ¼ min f4:170; 5:474g ¼ 4:170 We can also see that, while columns Stage Cluster First Appears Cluster 1 and Cluster 2 indicate in which previous stage each corresponding observation was incorporated into a certain cluster, column Next Stage shows in which future stage the respective cluster will receive a new observation or cluster, given that we are dealing with a clustering method. In the third stage, observation Patricia (3) is incorporated to the already formed cluster, Gabriela-Ovidio-Leonor, respecting the following distance criterion: dðGabrielaOvidioLeonorÞPatricia ¼ min f8:420; 6:580; 6:045g ¼ 6:045 And, finally, given that we have five observations, in the fourth and last stage, observation Luiz Felipe, which is still isolated (note that the last observation to be incorporated into a cluster corresponds to the last value equal to 0 in the column Stage Cluster First Appears Cluster 2), is incorporated to the cluster already formed by the other observations, concluding the agglomeration schedule. The distance considered at this stage is given by: dðGabrielaOvidioLeonorPatriciaÞLuizFelipe ¼ min f10:132; 10:290; 8:223; 7:187g ¼ 7:187 Based on how the observations are sorted in the agglomeration schedule and on the distances used as a clustering criterion, the dendrogram can be constructed, and it can be seen in Fig. 11.32. Note that the distance measures are rescaled to construct the dendrograms in SPSS, so that the interpretation of each observation allocation to the clusters and, mainly, visualizing the highest distance leaps can be made easier, as discussed in Section 11.2.2.1.2.1. The way the observations are sorted in the dendrogram corresponds to what was presented in the agglomeration schedule (Fig. 11.31), and, from the analysis shown in Fig. 11.32, it is possible to see that the greatest distance leap occurs when Patricia merges with Gabriela-Ovidio-Leonor, which had already been formed. This leap could have already been identified in the agglomeration schedule found in Fig. 11.31, since a large increase in distance occurs when we go from the second to the third stage, that is, when we increase the Euclidian distance from 4.170 to 6.045 (44.96%), so that a new cluster can be formed by incorporating another observation. Therefore, we can choose the existing configuration at the end of the second clustering stage, in which three clusters are formed. As discussed in Section 11.2.2.1.2.1, the criterion for identifying the number of clusters that considers the clustering stage immediately before a large leap is very useful and commonly used. Fig. 11.33 shows a vertical line (a dashed line) that “cuts” the dendrogram in the region where the highest leaps occur. At this moment, since three intersections with lines from the dendrogram happen, we can identify three corresponding clusters formed by Gabriela-Ovidio-Leonor, Patricia, and Luiz Felipe, respectively.

Cluster Analysis Chapter

Dendrogram Using Single Linkage

Y

0 Gabriela

1

Ovidio

4

Leonor

5

Patricia

3

5

Rescaled Distance Cluster Combine 10 15 20

25

Luiz Felipe 2 FIG. 11.32 Dendrogram—Single-linkage method and rescaled euclidian distances in SPSS.

Dendrogram Using Single Linkage

Y

0

5

Rescaled Distance Cluster Combine 10 15 20

Gabriela

1

Ovidio

4

Leonor

5

Patricia

3

Individual Cluster Patricia

Luiz Felipe 2

Individual Cluster Luiz Felipe

FIG. 11.33 Dendrogram with cluster identification.

Cluster Gabriela-Ovidio-Leonor

25

11

355

356

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.34 Defining the number of clusters.

As discussed, it is common to find dendrograms that make it difficult to identify distance leaps, mainly due to the fact that there are considerably similar observations in the dataset in relation to all the variables under analysis. In these situations, it is advisable to use the squared Euclidean distance and the complete-linkage method (furthest neighbor). This criteria combination is very popular in datasets with extremely homogeneous observations. Having adopted the solution with three clusters, we can once again click on Analyze → Classify → Hierarchical Cluster... and, on Statistics..., select the option Single solution in Cluster Membership. In this option, we must insert number 3 into Number of clusters, as shown in Fig. 11.34. When we click on Continue, we will go back to the main dialog box of the cluster analysis. On Save..., let’s choose the option Single solution and, in the same way, insert number 3 into Number of clusters, as shown in Fig. 11.35, so that the new variable corresponding to the allocation of observations to the clusters can become available in the dataset. Next, we can click on Continue and on OK. Although the outputs generated are the same, it is important to notice that a new table of results is presented, corresponding to the allocation of the observations to the clusters itself. Fig. 11.36 shows, for three clusters, that, while observations Gabriela, Ovidio, and Leonor form a single cluster, called 1, observations Luiz Felipe and Patricia form two individual clusters, called 2 and 3, respectively. Even though these names are numerical, it is important to highlight that they only represent the labels (categories) of a qualitative variable. When elaborating the procedure described, we can see that a new variable is generated in the dataset. It is called CLU3_1 by SPSS, as shown in Fig. 11.37. This new variable is automatically classified by the software as Nominal, that is, qualitative, as shown in Fig. 11.38, which can be obtained when we click on Variable View, in the lower left-hand side of the screen in SPSS. As we have already discussed, variable CLU3_1 can be used in other exploratory techniques, such as, the correspondence analysis, or in confirmatory techniques. In the latter, it can be inserted, for example, into the explanatory variables vector (as long as it is transformed into dummies) of a multiple regression model, or as a dependent variable of a certain multinomial logistic regression model, in which researchers intend to study the behavior of other variables, not inserted into the cluster analysis, concerning the probability of inserting each observation into each one of the clusters formed. However, this decision depends on the research objectives. At this moment, the researcher may consider the cluster analysis with hierarchical agglomeration schedules concluded. Nevertheless, based on the generation of the new variable CLU3_1, by using the one-way ANOVA, he may still study if the values of a certain variable differ between the clusters formed, that is, if the variability between the groups is significantly higher than the variability within each one of them. Even if the analysis had not been developed when solving the hierarchical schedules algebraically, since we chose to carry it out only after the k-means procedure in Section 11.2.2.2.2, we can now show how it can be applied at this moment, since we have already allocated the observations to the groups.

Cluster Analysis Chapter

11

357

FIG. 11.35 Selecting the option to save the allocation of observations to the clusters with the new variable in the dataset—Hierarchical procedure.

FIG. 11.36 Allocating the observations to the clusters.

FIG. 11.37 Dataset with the new variable CLU3_1—Allocation of each observation.

358

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.38 Nominal (qualitative) classification of the variable CLU3_1.

In order to do that, let’s click on Analyze → Compare Means → One-Way ANOVA.... In the dialog box that will open, we must insert the variables mathematics, physics, and chemistry into Dependent List and variable CLU3_1 (Single Linkage) into Factor. The dialog box will be as the one shown in Fig. 11.39. In Options..., let’s choose the options Descriptive (in Statistics) and Means plot, as shown in Fig. 11.40. Next, we can click on Continue and on OK. While Fig. 11.41 shows the descriptive statistics of the clusters per variable, similar to Tables 11.24, 11.25, and 11.26, Fig. 11.42 uses these values and shows the calculation of the variation between the groups (Between Groups) and within them (Within Groups), as well as the F statistics for each variable and the respective significance levels. We can see that these values correspond to the ones calculated algebraically in Section 11.2.2.2.2 and shown in Table 11.30. From Fig. 11.42, we can see that sig. F for the variable physics is less than 0.05 (sig. F ¼ 0.043), that is, there is at least one group that has a statistically different mean, when compared to the others, at a significance level of 0.05. However, the same cannot be said about the variables mathematics and chemistry. Although we have an idea of which group has a statistically different mean compared to the others for the variable physics, based on the outputs seen in Fig. 11.41, constructing the diagrams may facilitate the analysis of the differences between the variable means per cluster even more. The charts generated by SPSS (Figs. 11.43, 11.44, and 11.45) allow us to see these differences between the groups for each variable analyzed. Therefore, from the chart seen in Fig. 11.44, it is possible to see that group 2, formed only by observation Luiz Felipe, in fact, has a mean different from the others in relation to the variable physics. Besides, even though we can see from the diagrams in Figs. 11.43 and 11.45 that there are mean differences of the variables mathematics and chemistry between the groups, these differences cannot be considered statistically significant, at a significance level of 0.05, since we are dealing with a very small number of observations, and the F statistic values are very sensitive to the sample size. This graphical analysis becomes really useful when we are studying datasets with a larger number of observations and variables.

FIG. 11.39 Dialog box with the selection of the variables to run the one-way analysis of variance in SPSS.

Cluster Analysis Chapter

11

FIG. 11.40 Selecting the options to carry out the one-way analysis of variance.

FIG. 11.41 Descriptive statistics of the clusters per variable.

FIG. 11.42 One-way analysis of variance—Between groups and within groups variation, F statistics, and significance levels per variable.

359

360

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.43 Means of the variable mathematics in the three clusters.

Mean of mathematics grade (0 to 10)

9.0

8.0

7.0

6.0

5.0

4.0 1

2

3

Single linkage

Mean of physics grade (0 to 10)

8.0

6.0

4.0

2.0

1

2

3

Single linkage FIG. 11.44 Means of the variable physics in the three clusters.

Finally, researchers can still complement their analysis by elaborating a procedure known as multidimensional scaling, since using the distance matrix may help them construct a chart that allows a two-dimensional visualization of the relative positions of each observation, regardless of the total number of variables. In order to do that, we must structure a new dataset, formed exactly by the distance matrix. For the data in our example, we can open the file CollegeEntranceExamMatrix.sav, which contains the Euclidian distance matrix shown in Fig. 11.46. Note that the columns of this new dataset refer to the observations in the original dataset, as well as the rows (squared distance matrix).

Cluster Analysis Chapter

11

361

Mean of chemistry grade (0 to 10)

8.0

6.0

4.0

2.0

1

2

3

Single linkage FIG. 11.45 Means of the variable chemistry in the three clusters.

FIG. 11.46 Dataset with the Euclidean distance matrix.

Let’s click on Analyze → Scale → Multidimensional Scaling (ASCAL).... In the dialog box that will open, we must insert the variables that represent the observations in Variables, as shown in Fig. 11.39. Since the data already correspond to the distances, nothing needs to be done regarding the field Distances (Fig. 11.47). In Model..., let’s select the option Ratio in Level of Measurement (note that the option Euclidean distance in Scaling Model has already been selected) and, in Options..., the option Group plots in Display, as shown in Figs. 11.48 and 11.49, respectively. Next, we can click on Continue and on OK. Fig. 11.50 shows the chart with the relative positions of the observations projected on a plane. This type of chart is really useful when researchers wish to prepare didactical presentations of observation clusters (individuals, companies, municipalities, countries, among other examples) and to make the interpretation of the clusters easier, mainly when there is a relatively large number of variables in the dataset.

362

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.47 Dialog box with the selection of the variables to run the multidimensional scaling in SPSS.

FIG. 11.48 Defining the nature of the variable that corresponds to the distance measure.

Cluster Analysis Chapter

11

363

FIG. 11.49 Selecting the option for constructing the twodimensional chart.

Derived stimulus configuration Euclidean distance model 1.0 Gabriela

Leonor

LuizFelipe

Dimension 2

0.5

0.0

–0.5

Ovidio

–1.0

Patricia

–1.5 –2

–1

0 Dimension 1 FIG. 11.50 Two-dimensional chart with the projected relative positions of the observations.

1

2

364

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.51 Dialog box for elaborating the cluster analysis with the nonhierarchical K-means method in SPSS.

11.3.2

Elaborating Nonhierarchical K-Means Agglomeration Schedules in SPSS

Maintaining the same logic proposed in the chapter, from the same dataset, we will develop a cluster analysis based on the nonhierarchical k-means agglomeration schedule. Thus, we must once again use the file CollegeEntranceExams.sav. In order to do that, we must click on Analyze → Classify → K-Means Cluster.... In the dialog box that will open, we must insert the variables mathematics, physics, and chemistry into Variables, and the variable student into Label Cases by. The main difference between this initial dialog box and the one corresponding to the hierarchical procedure is determining the number of clusters from which the k-means algorithm will be elaborated. In our example, let’s insert number 3 into Number of Clusters. Fig. 11.51 shows how the dialog box will be. We can see that we inserted the original variables into the field Variables. This procedure is acceptable, since, for our example, the values are in the same unit of measure. However, if this fact is not verified, before elaborating the k-means procedure, researchers must standardize them through the Z-scores procedure, in Analyze → Descriptive Statistics → Descriptives..., insert the original variables into Variables, and select the option Save standardized values as variables. When we click on OK, researchers will see that new standardized variables will become part of the dataset. Going back to the initial screen of the k-means procedure, we will click on Save.... In the dialog box that will open, we must select the option Cluster membership, as shown in Fig. 11.52. When we click on Continue, we will go back to the previous dialog box. In Options..., let’s select the options Initial cluster centers, ANOVA table, and Cluster information for each case, in Statistics, as shown in Fig. 11.53. Next, we can click on Continue and on OK. It is important to mention that SPSS already uses the Euclidian distance as a standard dissimilarity measure when elaborating the k-means procedure.

Cluster Analysis Chapter

11

365

FIG. 11.52 Selecting the option to save the allocation of observations to the clusters with the new variable in the dataset—Nonhierarchical procedure.

FIG. 11.53 Selecting the options to perform the K-means procedure.

The first two outputs generated refer to the initial step and to the iteration of the k-means algorithm. The centroid coordinates are presented in the initial step and, through them, we can notice that SPSS considers that the three clusters are formed by the first three observations in the dataset. Although this decision is different from the one we used in Section 11.2.2.2.2, this choice is purely arbitrary and, as we will see later, it will not impact the formation of clusters in the final step of the k-means algorithm at all. While Fig. 11.54 shows the values of the original variables for observations Gabriela, Luiz Felipe, and Patricia (as shown in Table 11.16) as the centroid coordinates of the three groups, in Fig. 11.55 we can see, after the first iteration of the algorithm, that the change in the centroid coordinate of the first cluster is 1.897, which corresponds exactly to the Euclidian distance between observation Gabriela and the cluster Gabriela-Ovidio-Leonor (as shown in Table 11.23). In this last

FIG. 11.54 First step of the K-means algorithm—Centroids of the three groups as observation coordinates.

366

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.55 First iteration of the K-means algorithm and change in the centroid coordinates.

FIG. 11.56 Final stage of the K-means algorithm—Allocation of the observations and distances to the respective cluster centroids.

figure, in the footnotes, it is also possible to see the measure 7.187 that corresponds to the Euclidian distance between observations Luiz Felipe and Patricia, which remain isolated after the iteration. The next three figures refer to the final stage of the k-means algorithm. While the output Cluster Membership (Fig. 11.56) shows the allocation of each observation to each one of the three clusters, as well as the Euclidian distances between each observation and the centroid of the respective group, the output Distances between Final Cluster Centers (Fig. 11.58) shows the Euclidian distances between the group centroids. These two outputs have values that were calculated algebraically in Section 11.2.2.2.2 and shown in Table 11.23. Moreover, the output Final Cluster Centers (Fig. 11.57) shows the centroid coordinates of the groups after the final stage of this nonhierarchical procedure, which correspond to the values already calculated and presented in Table 11.22.

FIG. 11.57 Final stage of the K-Means algorithm—Cluster centroid coordinates.

Cluster Analysis Chapter

11

367

FIG. 11.58 Final stage of the K-means algorithm—Distances between the cluster centroids.

FIG. 11.59 One-way analysis of variance in the K-means procedure—Variation between groups and within groups, F statistics, and significance levels per variable.

The ANOVA output (Fig. 11.59) is analogous to the one presented in Table 11.30 in Section 2.2.2.2 and in Fig. 11.42 in Section 11.3.1, and, through it, we can see that only the variable physics has a statistically different mean in at least one of the groups formed, when compared to the others, at a significance level of 0.05. As we have previously discussed, if one or more variables are not contributing to the formation of the suggested number of clusters, we recommend that the algorithm be reapplied without these variables. The researcher can even use a hierarchical procedure without the aforementioned variables before reapplying the k-means procedure. For the data in our example, however, the analysis would become univariate due to the exclusion of the variables mathematics and chemistry, which demonstrates the risk researchers take when working with extremely small datasets in cluster analysis. It is important to mention that the ANOVA output must only be used when studying the variables that most contribute to the formation of the specified number of clusters, since this is chosen so that the differences between the observations allocated to different groups can be maximized. Thus, as explained in this output’s footnotes, we cannot use the F statistic aiming at verifying the equality or not of the groups formed. For this reason, it is common to find the term pseudo F for this statistic in the existing literature. Finally, Fig. 11.60 shows the number of observations in each one of the clusters. Similar to the hierarchical procedure, we can see that a new variable (obviously qualitative) is generated in the dataset after the preparation of the k-means procedure, which is called QCL_1 by SPSS, as shown in Fig. 11.61. This variable ended up being identical to the variable CLU3_1 (Fig. 11.37) in this example. Nonetheless, this fact does not always happen with a larger number of observations and in the cases in which different dissimilarity measures are used in the hierarchical and nonhierarchical procedures. Having presented the procedures for the application of the cluster analysis in SPSS, let’s discuss this technique in Stata.

FIG. 11.60 Number of observations in each cluster.

368

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.61 Dataset with the new variable QCL_1—Allocation of each observation.

11.4 CLUSTER ANALYSIS WITH HIERARCHICAL AND NONHIERARCHICAL AGGLOMERATION SCHEDULES IN STATA Now, we will present the step by step for preparing our example in Stata Statistical Software®. In this section, our main objective is not to once again discuss the concepts related to the cluster analysis, but to give the researcher an opportunity to prepare the technique by using the commands this software has to offer. At each presentation of an output, we will mention the respective result obtained when performing its algebraic solution and also by using SPSS. The use of the images in this section has been authorized by StataCorp LP©.

11.4.1

Elaborating Hierarchical Agglomeration Schedules in Stata

Therefore, let’s begin with the dataset constructed by the professor and which contains the grades in Mathematics, Physics, and Chemistry obtained by five students in the college entrance exams. The dataset can be found in the file CollegeEntranceExams.dta and is exactly the same as the one presented in Table 11.12 in Section 11.2.2.1.2. Initially, we can type the command desc, which makes the analysis of the dataset characteristics possible, such as, the number of observations, the number of variables, and the description of each one of them. Fig. 11.62 shows the first output in Stata. As discussed previously, since the original variables have values in the same unit of measure, in this example, it is not necessary to standardize them by using the Z-scores procedure. However, if the researcher wishes to, he may obtain the standardized variables through the following commands: egen zmathematics = std(mathematics) egen zphysics = std(physics) egen zchemistry = std(chemistry)

FIG. 11.62 Description of the CollegeEntranceExams.dta dataset.

Cluster Analysis Chapter

11

369

TABLE 11.31 Terms in Stata Corresponding to the Measures for Metric Variables Measure for Metric Variables

Term in Stata

Euclidian

L2

Squared Euclidean

L2squared

Manhattan

L1

Chebyshev

Linf

Canberra

Canberra

Pearson’s Correlation

corr

First of all, let’s obtain the matrix with distances between the pairs of observations. In general, the sequence of commands for obtaining distance or similarity matrices in Stata is: matrix dissimilarity D = variables*, option* matrix list D

where the term variables* will have to be substituted for the list of variables to be considered in the analysis, and the term option* will have to be substituted for the term corresponding to the distance or similarity measure that the researcher wishes to use. While Table 11.31 shows the terms in Stata that correspond to each one of the measures for the metric variables studied in Section 11.2.1.1, Table 11.32 shows the terms related to the measures used for the binary variables studied in Section 11.2.1.2. Therefore, since we wish to obtain the Euclidian distance matrix between the pairs of observations, in order to maintain the criterion used in the chapter, we must type the following sequence of commands: matrix dissimilarity D = mathematics physics chemistry, L2 matrix list D

The output generated, which can be seen in Fig. 11.63, is in accordance with what was presented in matrix D0 in Section 11.2.2.1.2.1, and also in Fig. 11.30 when we elaborated the technique in SPSS (Section 11.3.1). Next, we will carry out the cluster analysis itself. The general command used to run a cluster analysis through a hierarchical schedule in Stata is given by: cluster method* variables*, measure(option*)

where, besides the substitution of the terms variables* and option*, as discussed previously, we must substitute the term method* for the linkage method chosen by the researcher. Table 11.33 shows the terms in Stata related to the methods discussed in Section 11.2.2.1. TABLE 11.32 Terms in Stata Corresponding to the Measures for Binary Variables Measure for Binary Variables

Term in Stata

Simple matching

matching

Jaccard

Jaccard

Dice

Dice

AntiDice

antiDice

Russell and Rao

Russell

Ochiai

Ochiai

Yule

Yule

Rogers and Tanimoto

Rogers

Sneath and Sokal

Sneath

Hamann

Hamann

370

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.63 Euclidean distance matrix between pairs of observations.

TABLE 11.33 Terms in Stata That Correspond to the Linkage Methods in Hierarchical Agglomeration Schedules Linkage Method

Term in Stata

Single

singlelinkage

Complete

completelinkage

Average

averagelinkage

Therefore, for the data in our example and following the criterion adopted throughout this chapter (single-linkage method with Euclidian distance - term L2), we must type the following command: cluster singlelinkage mathematics physics chemistry, measure(L2)

After that, we can type the command cluster list, which makes, in a summarized way, the criteria used by the researcher to develop the hierarchical cluster analysis. Fig. 11.64 shows the outputs generated. From Fig. 11.64 and by analyzing the dataset, we can verify that three new variables are generated, regarding the identification of each observation (_clus_1_id), the sorting of the observations when creating the clusters (_clus_1_ord), and the Euclidian distances used in order to group the new observation in each one of the clustering stages (_clus_1_hgt). Fig. 11.65 shows how the dataset is after this cluster analysis is elaborated. It is important to mention that Stata shows the variable _clu_1_hgt with the old values in one row, which can make the analysis a little confusing. Therefore, while distance 3.713 refers to the merger between observations Ovidio and Gabriela (first stage of the agglomeration schedule), distance 7.187 corresponds to the fusion between Luiz Felipe and the cluster already formed by all the other observations (last stage of the agglomeration schedule), as already shown in Table 11.13 and in Fig. 11.31. Thus, in order for researchers to correct this discrepancy and to obtain the real behavior of the distances in each new clustering stage, they can type the sequence of commands, whose output can be seen in Fig. 11.66. Note that a new variable

FIG. 11.64 Elaboration of the hierarchical cluster analysis and summary of the criteria used.

Cluster Analysis Chapter

11

371

FIG. 11.65 Dataset with the new variables.

FIG. 11.66 Stages of the agglomeration schedule and respective Euclidian distances.

is generated (dist) and it corresponds to the correction of the discrepancy found in variable _clu_1_hgt (term [_n-1]), presenting the value of each Euclidian distance in order to establish a new cluster in each stage of the agglomeration schedule. gen dist = _clus_1_hgt[_n-1] replace dist=0 if dist==. sort dist list student dist

Having carried out this phase, we can ask Stata to construct the dendrogram by typing one of the two equivalent commands: cluster dendrogram, labels(student) horizontal

or cluster tree, labels(student) horizontal

The diagram generated can be seen in Fig. 11.67. We can see that the dendrogram constructed by Stata, in terms of Euclidian distances, is equal to the one shown in Fig. 11.12, constructed when the modeling was solved algebraically. However, it differs from the one constructed by SPSS (Fig. 11.32) for not considering rescaled measures. Regardless of this fact, we will adopt three clusters as a possible solution, being one of them formed by Leonor, Ovidio, and Gabriela, another, by Patricia, and the third, by Luiz Felipe, since the criteria discussed about large distance leaps coherently lead us toward this decision. In order to generate a new variable, corresponding to the allocation of the observations to the three clusters, we must type the following sequence of commands. Note that we have named this new variable cluster. The output seen in Fig. 11.68 shows the allocation of the observations to the groups and is equivalent to the one shown in Fig. 11.36 (SPSS). cluster generate cluster = groups(3), name(_clus_1) sort _clus_1_id list student cluster

372

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.67 Dendrogram—Single-linkage method and Euclidian distances in Stata.

Dendrogram for_clus_1 cluster analysis Leonor

Ovidio

Gabriela

Patricia

Luiz Felipe 0

2

4

6

8

L2 dissimilarity measure

FIG. 11.68 Allocating the observations to the clusters.

Finally, by using the one-way analysis of variance (ANOVA), we will study if the values of a certain variable differ between the groups represented by the categories of the new qualitative variable cluster generated in the dataset, that is, if the variation between the groups is significantly higher than the variation within each one of them, following the logic proposed in Section 11.3.1. In order to do that, let’s type the following commands, in which the three metric variables (mathematics, physics, and chemistry) are individually related to the variable cluster: oneway mathematics cluster, tabulate oneway physics cluster, tabulate oneway chemistry cluster, tabulate

The results of the ANOVA for the three variables are in Fig. 11.69. The outputs in this figure, which show the results of the variation Between groups and Within groups, the F statistics, and the respective significance levels (Prob. F, or Prob > F in Stata) for each variable, are equal to the ones calculated algebraically and presented in Table 11.30 (Section 11.2.2.2.2) and also in Fig. 11.42, when this procedure was elaborated in SPSS (Section 11.3.1). Therefore, as we have already discussed, we can see that, while for the variable physics there is at least one cluster that has a statistically different mean, when compared to the others, at a significance level of 0.05 (Prob. F ¼ 0.0429 < 0.05), the variables mathematics and chemistry do not have statistically different means between the three groups formed for this sample and at the significance level set. It is important to bear in mind that, if there is a greater number of variables that have Prob. F less than 0.05, the one considered the most discriminant of the groups is the one with the highest F statistic (that is, the lowest significance level Prob. F).

Cluster Analysis Chapter

11

373

FIG. 11.69 ANOVA for the variables mathematics, physics, and chemistry.

Even if it is possible to conclude the hierarchical analysis at this moment, the researcher has the option to run a multidimensional scaling, in order to see the projections of the relative positions of the observations in a two-dimensional chart, similar to what was done in Section 11.3.1. In order to do that, he may type the following command: mds mathematics physics chemistry, id(student) method(modern) measure(L2) loss(sstress) config nolog

The outputs generated can be found in Figs. 11.70 and 11.71, and the chart of the latter is the one shown in Fig. 11.50.

374

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.70 Elaborating the multidimensional scaling in Stata. FIG. 11.71 Chart with projections of the relative positions of the observations.

Having presented the commands to carry out the cluster analysis with hierarchical agglomeration schedules in Stata, let’s move on to the elaboration of the nonhierarchical k-means agglomeration schedule in the same software package.

11.4.2

Elaborating Nonhierarchical K-Means Agglomeration Schedules in Stata

In order to apply the k-means procedure to the data in the file CollegeEntranceExams.dta, we must type the following command: cluster kmeans mathematics physics chemistry, k(3) name(kmeans) measure(L2) start(firstk)

where the term k(3) is the input for the algorithm to be elaborated with three clusters. Besides, we define that a new variable with the allocation of the observations to the three groups will be generated in the dataset with the name kmeans (term name(kmeans)), and the distance measure used will be the Euclidian distance (term L2). Moreover, the term firstk specifies that the coordinates of the first k observations of the sample will be used as centroids of the k clusters (in our case, k ¼ 3), which corresponds exactly to the criterion adopted by SPSS, as discussed in Section 11.3.2.

Cluster Analysis Chapter

11

375

FIG. 11.72 Elaborating the nonhierarchical K-means procedure and a summary of the criteria used.

Next, we can type the command cluster list kmeans so that, in a summarized way, the criteria adopted for elaborating the k-means procedure can be presented. The outputs in Fig. 11.72 show what is generated by Stata after we type the last two commands. The next two commands generate, in the outputs of the software, two tables that refer to the number of observations in each one of the three clusters formed, as well as to the allocation of each observation in these groups, respectively: table kmeans list student kmeans

Fig. 11.73 shows these outputs. These results correspond to the one found when the k-means procedure was solved algebraically in Section 11.2.2.2.2 (Fig. 11.23), and to the one obtained when this procedure was elaborated using SPSS in Section 11.3.2 (Figs. 11.60 and 11.61). Even though we are able to develop a one-way analysis of variance for the original variables in the dataset, from the new qualitative variable generated (kmeans), we chose not to carry out this procedure here, since we have already done that for the variable cluster generated in Section 11.4.1 after the hierarchical procedure, which is exactly the same as the variable kmeans in this case. On the other hand, for pedagogical purposes, we present the command that allows the means of each variable in the three clusters to be generated, so that they can be compared: tabstat mathematics physics chemistry, by(kmeans)

The output generated can be seen in Fig. 11.74 and is equivalent to the one presented in Tables 11.24, 11.25, and 11.26. Finally, the researcher can also construct a chart to show the interrelationships between the variables, two at a time. This chart, known as matrix, can give the researcher a better understanding of how the variables relate to one another and even

FIG. 11.73 Number of observations in each cluster and allocation of observations.

376

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.74 Means per cluster and general means of the variables mathematics, physics, and chemistry.

FIG. 11.75 Interrelationship between the variables and relative position of the observations in each cluster—matrix chart.

make suggestions regarding the relative position of the observations in each cluster in these interrelationships. To construct the chart shown in Fig. 11.75, we must type the following command: graph matrix mathematics physics chemistry, mlabel(kmeans)

Obviously, this chart could have also been constructed in the previous section. However, we chose to present it only at the end of the preparation of the k-means procedure in Stata. By analyzing it, it is possible to verify, among other things, that only considering the variables mathematics and chemistry is not enough to make observations Luiz Felipe and Patricia (clusters 2 and 3, respectively) stay further apart. It is necessary to consider the variable physics so that these two students can, in fact, be allocated to different clusters when forming three clusters. Although it may seem pretty obvious when analyzing the data in their own dataset, the chart becomes extremely useful for larger samples with a considerable number of variables, fact that would multiply these interrelationships.

Cluster Analysis Chapter

11

377

11.5 FINAL REMARKS Many are the situations in which researchers may wish to group observations (individuals, companies, municipalities, countries, political parties, plant species, among other examples) from certain metric or even binary variables. Creating homogeneous clusters, reducing data structurally, and verifying the validity of previously established constructs are some of the main reasons that make researchers choose to work with cluster analysis. This set of techniques allows decision-making mechanisms to be better structured and justified from the behavior and interdependence relationship between the observations of a certain dataset. Since the variable that represents the clusters formed is qualitative, the outputs of the cluster analysis can serve as inputs in other multivariate techniques, both exploratory as well as confirmatory ones. It is strongly advisable for researchers to justify, clearly and transparently, the measure they chose and that will serve as the basis for the observations to be considered more or less similar, as well as the reasons that make them define nonhierarchical or hierarchical agglomeration schedules and, in this last case, determine the linkage methods. In the last few years, the evolution of technological capabilities and the development of new software, with extremely improved resources, caused new and better cluster analysis techniques to arise. Techniques that use more and more sophisticated algorithms and that are aimed at the decision-making process in several fields of knowledge, always with the main goal of grouping observations based on certain criteria. However, in this chapter, we tried to offer a general overview of the main cluster analysis methods, also considered to be the most popular. Lastly, we would like to highlight that the application of this important set of techniques must always be done by using the software chosen for the modeling correctly and sensibly, based on the underlying theory and on researchers’ experience and intuition.

11.6 EXERCISES 1) The scholarship department of a certain college wishes to investigate the interdependence relationship between the students entering university in a certain school year, based only on two metric variables (age, in years, and average family income, in US$). The main objective is to propose a still unknown number of new scholarship programs aimed at homogeneous groups of students. In order to do that, data on 100 new students were collected and a dataset was constructed, which can be found in the files Scholarship.sav and Scholarship.dta, with the following variables:

Variable

Description

student

A string variable that identifies all freshmen in the college

age

Student’s age (years)

income

Average family income (US$)

We would like you to: a) Run a cluster analysis through a hierarchical agglomeration schedule, with the complete-linkage method (furthest neighbor) and the squared Euclidean distance. Only present the final part of the agglomeration schedule table and discuss the results. Reminder: Since the variables have different units of measure, it is necessary to apply the Z-scores standardization procedure to prepare the cluster analysis correctly. b) Based on the table found in the previous item and in the dendrogram, we ask you: how many clusters of students will be formed? c) Is it possible to identify one or more very discrepant students, in comparison to the others, regarding the two variables under analysis? d) If the answer to the previous item is “yes,” once again run the hierarchical cluster analysis with the same criteria, however, now, without the student(s) considered discrepant. From the analysis of the new results, can new clusters be identified? e) Discuss how the presence of outliers can hamper the interpretation of results in a clusters analysis.

2) The marketing department of a retail company wants to study possible discrepancies in their 18 stores spread throughout three regional centers and distributed all over the country. In order to maintain and preserve its brand’s image and identity, top management would like to know if their stores are homogeneous in terms of customers’

378

PART

V

Multivariate Exploratory Data Analysis

perception of attributes, such as, services, variety of goods, and organization. Thus, first, a research with samples of clients was developed in each store, so that data regarding these attributes could be collected. These were defined based on the average score obtained (0 to 100) in each store. Next, a dataset was constructed and it contains the following variables: Variable

Description

store

A string variable that varies from 01 the 18 and that identifies the commercial establishment (store)

regional

A string variable that identifies each regional center (Regional 1 to Regional 3)

services

Customers’ average evaluation of services rendered (score from 0 to 100)

assortment

Customers’ average evaluation of the variety of goods (score from 0 to 100)

organization

Customers’ average evaluation of the organization (score from 0 to 100)

These data can be found in the files Retail Regional Center.sav and Retail Regional Center.dta. We would like you to: a) Run a cluster analysis through a hierarchical agglomeration schedule, with the single-linkage method and the Euclidean distance. Present the matrix with distances between each pair of observations. Reminder: Since the variables are in the same unit, it is not necessary to apply the Z-scores standardization procedure. b) Present and discuss the agglomeration schedule table. c) Based on the table found in the previous item and in the dendrogram, we ask you: how many clusters of stores will be formed? d) Run a multidimensional scaling and, after that, present and discuss the two-dimensional chart generated with the relative positions of the stores. e) Run a cluster analysis by using the k-means procedure, with the number of clusters suggested in item (c), and interpret the one-way analysis of variance for each variable considered in the study, considering a significance level of 0.05. Which variable contributes the most to the creation of at least one of the clusters formed, that is, which of them is the most discriminant of the groups? f) Is there any correspondence between the allocations of the observations to the groups obtained by the hierarchical and nonhierarchical methods? g) Is it possible to identify an association between any regional center and a certain discrepant group of stores, which could justify the management’s concern regarding the brand’s image and identity? If the answer is “yes,” once again run the hierarchical cluster analysis with the same criteria, however, now, without this discrepant group of stores. By analyzing the new results, is it possible to see the differences between the others stores more clearly? 3) A financial market analyst has decided to carry out a survey with CEOs and directors of large companies that operate in the health, education, and transport industries, in order to investigate how these companies’ operations are carried out and the mechanisms that guide their decision making processes. In order to do that, he structured a questionnaire with 50 questions, whose answers are only dichotomous, or binary. After applying the questionnaire, he got answers from 35 companies and, from then on, structured a dataset, present in the files Binary Survey.sav and Binary Survey.dta. In a generic way, the variables are:

Variable

Description

q1 to q50

A list of 50 dummy variables that refer to the way the operations and the decision-making processes are carried out in these companies

sector

Company sector

The analyst’s main goal is to verify whether companies in the same sector show similarities in relation to the way their operations and decision making processes are carried out, at least from their own managers’ perspective. In order to do that, after collecting the data, a cluster analysis can be elaborated. We would like you to: a) Based on the hierarchical cluster analysis elaborated with the average-linkage method (between groups) and the simple matching similarity measure for binary variables, analyze the agglomeration schedule generated. b) Interpret the dendrogram.

Cluster Analysis Chapter

11

379

c) Check if there is any correspondence between the allocations of the companies to the clusters and the respective sectors, or, in other words, if the companies in the same sector show similarities regarding the way their operations and decisionmaking processes are carried out.

4) A greengrocer has decided to monitor the sales of his products for 16 weeks (4 months). The main objective is to verify if the sales behavior of three of their main products (bananas, oranges, and apples) is recurrent after a certain period, due to weekly wholesale price fluctuations, prices that are passed on to customers and may impact sales. These data can be found in the files Veggiefruit.sav and Veggiefruit.dta, which have the following variables:

Variable

Description

week

A string variable that varies from 1 to 16 and identifies the week in which the sales were monitored

week_month

A string variable that varies from 1 to 4 and identifies the week in each one of the months

banana

Number of bananas sold that week (un.)

orange

Number of oranges sold that week (un.)

apple

Number of apples sold that week (un.)

We would like you to: a) Run a cluster analysis through a hierarchical agglomeration schedule, with the single-linkage method (nearest neighbor) and Pearson’s correlation measure. Present the matrix of similarity measures (Pearson’s correlation) between each row in the dataset (weekly periods). Reminder: Since the variables are in the same unit of measure, it is not necessary to apply the Z-scores standardization procedure. b) Present and discuss the agglomeration schedule table. c) Based on the table found in the previous item and on the dendrogram, we ask you: is there any indication that the joint sales behavior of bananas, oranges and apples is recurrent in certain weeks?

APPENDIX A.1

Detecting Multivariate Outliers

Even though detecting outliers is extremely important when applying practically every single multivariate data analysis technique, we chose to add this Appendix to the present chapter because cluster analysis represents the first set of multivariate exploratory techniques being studied, whose outputs can be used as inputs in several other techniques, as well as because very discrepant observations may significantly interfere in the creation of clusters. Barnett and Lewis (1994) mention almost 1000 articles in the existing literature on outliers. However, we chose to show a very effective, computationally simple, and fast algorithm for detecting multivariate outliers, bearing in mind that the identification of outliers for each variable individually, that is, in a univariate way, has already been studied in Chapter 3. A) Brief Presentation of the Blocked Adaptive Computationally Efficient Outlier Nominators Algorithm Billor et al. (2000), in extremely important work, show an interesting algorithm that has the purpose of detecting multivariate outliers. It is called Blocked Adaptive Computationally Efficient Outlier Nominators or simply BACON. This algorithm, explained in a very clear and didactical way by Weber (2010), is defined based on the preparation of a few steps, described briefly: 1. From a dataset with n observations and j (j ¼ 1, ..., k) variables X, in which each observation is identified by i (i ¼ 1, ..., n), the distance between one observation i that has a vector with dimension k xi ¼ ðxi1 , xi2 , …, xik Þ and the general mean of all sample values (group G), which also has a vector with dimension k x ¼ ðx1 , x2 , …, xk Þ, is given by the following expression, known as the Mahalanobis distance:

diG ¼

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðxi  xÞ’  S1  ðxi  xÞ

(11.29)

380

PART

V

Multivariate Exploratory Data Analysis

where S represents the covariance matrix of the n observations. Therefore, the first step of the algorithm consists in identifying m (m > k) homogeneous observations (initial group M) that have the smallest Mahalanobis distances in relation to the entire sample. It is important to mention that the dissimilarity measure known as Mahalanobis distance, not discussed in this chapter, is adopted by the aforementioned authors due to the fact that it is not susceptible to variables that are in different measurement units. 2. Next, the Mahalanobis distances between each observation i and the mean of the m observation values that belong to group M are calculated, which also has a vector with dimension k xM ¼ ðxM1 , xM2 , …, xMk Þ, such that: diM ¼

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðxi  xM Þ’  S1 M  ðx i  x M Þ

(11.30)

where SM represents the covariance matrix of the m observations. 3. All the observations with Mahalanobis distances less than a certain threshold are added to the group M of observations. This threshold is defined as a corrected percentile of the w2 distribution (85% in the Stata standard). Steps 2 and 3 must be reapplied until there are no more modifications in group M, which will only have observations that are not considered outliers. Hence, the ones excluded from the group will be considered multivariate outliers. Weber (2010) codifies the algorithm proposed in the paper written by Billor et al. (2000) in Stata, thus proposing the command bacon. Next, we will present and discuss an example in which this command is used, and whose main advantage is to be very fast computationally, even when applied to large datasets. B) Example: The command bacon in Stata Before the specific preparation of this procedure in Stata, we must install the command bacon by typing findit bacon and clicking on the link st0197 from http://www.stata-journal.com/software/sj10-3. After that, we must click on click here to install. Lastly, going back to the Stata command screen, we can type ssc install moremata and mata: mata mlib index. Having done this, we may apply the command bacon. To apply this command, let’s use the file Bacon.dta, which shows data on the median household income (US$) of 20,000 engineers, their age (years), and time he(she) has had a college degree (years). First of all, we can type the command desc, which makes the analysis of the dataset characteristics possible. Fig. 11.76 shows this first output. Next, we can type the following command that, based on the algorithm presented, identifies the observations considered multivariate outliers: bacon income age tgrad, generate(outbacon)

where the term generate(outbacon) makes a new dummy variable be generated in the dataset, called outbacon, which has values equal to 0 for observations not considered outliers, and values equal to 1 for the ones considered outliers. This output can be seen in Fig. 11.77.

FIG. 11.76 Description of the Bacon.dta dataset.

Cluster Analysis Chapter

11

381

FIG. 11.77 Applying the command bacon in Stata.

FIG. 11.78 Observations classified as multivariate outliers.

Through the figure, it is possible to see that four observations are classified as multivariate outliers. Besides, Stata considers 85% of the percentile standard of the w2 distribution, used as a separation threshold between the observations considered outliers and nonoutliers, as previously discussed and highlighted by Weber (2010). This is the reason why the term BACON outliers (p = 0.15) appears in the outputs. This value may be altered due to a criterion established by the researcher. However, we would like to emphasize that the standard percentile(0.15) is very adequate for obtaining consistent answers. From the following command, which generates the output seen in Fig. 11.78, we can investigate which observations are classified as outliers: list if outbacon == 1

Even if we are working with three variables, we can construct two-dimensional scatter plots, which allow us to identify the positions of the observations considered outliers in relation to the others. In order to do that, let’s type the following commands, which generate the mentioned charts for each pair of variables: scatter income age, ml(outbacon) note("0 = not outlier, 1 = outlier") scatter income tgrad, ml(outbacon) note("0 = not outlier, 1 = outlier") scatter age tgrad, ml(outbacon) note("0 = not outlier, 1 = outlier")

These three charts can be seen in Figs. 11.79, 11.80, and 11.81.

FIG. 11.79 Variables income and age—Relative position of the observations.

Median household income (US$)

40,000

30,000

20,000

10,000

0 20

30

0 = not outlier, 1 = outlier

40 Age (years)

50

60

382

PART

V

Multivariate Exploratory Data Analysis

tgrad—

40,000 Median household income (US$)

FIG. 11.80 Variables income and Relative position of the observations.

30,000

20,000

10,000

0 0

5

10 15 Time since graduation (years)

20

10 15 Time since graduation (years)

20

0 = not outlier, 1 = outlier

FIG. 11.81 Variables age position of the observations.

and

tgrad—Relative

60

Age (years)

50

40

30

20 0

5

0 = not outlier, 1 = outlier

Despite the fact that outliers have been identified, it is important to mention that the decision about what to do with these observations is entirely up to researchers, who must make it based on their research objectives. As already discussed throughout this chapter, excluding these outliers from the dataset may be an option. However, studying why they became multivariately discrepant can also result in many interesting research outcomes.