Chapter 4
Bivariate Descriptive Statistics Numbers rule the world. Plato
4.1
INTRODUCTION
The previous chapter discussed descriptive statistics for a single variable (univariate descriptive statistics). This chapter presents the concepts of descriptive statistics involving two variables (bivariate analysis). Therefore, a bivariate analysis has as its main objective to study the relationships (associations for qualitative variables and correlations for quantitative variables) between two variables. These relationships can be studied through the joint distribution of frequencies (contingency tables or crossed classification tables—cross tabulation), graphical representations, and through summary measures. The bivariate analysis will be studied from two distinct situations: a) When two variables are qualitative; b) When two variables are quantitative. Fig. 4.1 shows the bivariate descriptive statistics that will be studied in this chapter, represented by tables, charts, and summary measures, and presents the following situations: a) The descriptive statistics used to represent the data behavior of two qualitative variables are: a) joint frequency distribution tables, in this specific case, also called contingency tables or crossed classification tables (cross tabulation); b) charts, such as, perceptual maps resulting from the correspondence analysis technique (more details can be found in Fa´vero and Belfiore, 2017); c) measures of association, such as, the chi-square statistics (used for nominal and ordinal qualitative variables), the Phi coefficient, the contingency coefficient, and Cramer’s V coefficient (all of them based on chi-square and used for nominal variables), in addition to Spearman’s coefficient (for ordinal qualitative variables). b) In the case of two quantitative variables, we are going to use joint frequency distribution tables, graphical representations, such as, the scatter plot, besides measures of correlation, such as, covariance and Pearson’s correlation coefficient.
4.2
ASSOCIATION BETWEEN TWO QUALITATIVE VARIABLES
The main objective is to assess if there is a relationship between the qualitative or categorical variables studied, in addition to the level of association between them. This can be done through frequency distribution tables, summary measures, such as, the chi-square (used for nominal and ordinal variables), the Phi coefficient, the contingency coefficient, and Cramer’s V coefficient (for nominal variables), and Spearman’s coefficient (for ordinal variables), in addition to graphical representations, such as, perceptual maps resulting from the correspondence analysis, as presented in Fa´vero and Belfiore (2017).
4.2.1
Joint Frequency Distribution Tables
The simplest way to summarize a set of data resulting from two qualitative variables is through a joint frequency distribution table, in this specific case, it is called a contingency table, or a crossed classification table (cross tabulation), or even a correspondence table. In a joint way, it shows the absolute or relative frequencies of variable X’s categories, represented on the X-axis, and of variable Y, represented on the Y-axis. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00004-5 © 2019 Elsevier Inc. All rights reserved.
93
94
PART
II Descriptive Statistics
Bivariate analysis 2 Qualitative variables
Tables
Charts
Contingency tables
Perceptual maps
2 Quantitative variables Measures of association
Chi-square
Tables
Charts
Frequency distribution
Scatter Plot
Phi coefficient
Measures of correlation
Covariance Pearson’s correlation coefficient
Contingency coefficient Cramer’s V coefficient Spearman’s coefficient
FIG. 4.1 Bivariate descriptive statistics depending on the type of variable.
It is common to add the marginal totals to the contingency table, which correspond to the sum of variable X’s rows and to the sum of variable Y’s columns. We are going to illustrate this analysis through an example based on Bussab and Morettin (2011). Example 4.1 A study was done with 200 individuals trying to analyze the joint behavior of variable X (Health insurance agency) with variable Y (Level of satisfaction). The contingency table showing the variables’ joint absolute frequency distribution, in addition to the marginal totals, is shown in Table 4.E.1. These data are available on the SPSS software in the file HealthInsurance.sav.
TABLE 4.E.1 Joint Absolute Frequency Distribution of the Variables Being Studied Level of Satisfaction Agency
Dissatisfied
Neutral
Satisfied
Total
Total Health
40
16
12
68
Live Life
32
24
16
72
Mena Health
24
32
4
60
Total
96
72
32
200
The study can also be carried out based on the relative frequencies, as studied in Chapter 3, for univariate problems. Bussab and Morettin (2011) show three ways to illustrate the proportion of each category: a) In relation to the general total; b) In relation to the total of each row; c) In relation to the total of each column. Choosing each option varies according to the objective of the problem. For example, Table 4.E.2 shows the joint relative frequency distribution of the variables being studied in relation to the general total.
Bivariate Descriptive Statistics Chapter
4
TABLE 4.E.2 Joint Relative Frequency Distribution of the Variables Being Studied in Relation to the General Total Level of Satisfaction Agency
Dissatisfied
Neutral
Satisfied
Total
Total Health
20%
8%
6%
34%
Live Life
16%
12%
8%
36%
Mena Health
12%
16%
2%
30%
Total
48%
36%
16%
100%
First, we are going to analyze the marginal totals of the rows and columns that provide the unidimensional distributions of each variable. The marginal totals of the rows correspond to the sum of the relative frequencies of each category of the variable Agency and the marginal totals of the columns correspond to the sum of each category of the variable Level of satisfaction. Thus, we can conclude that 34% of the individuals are members of Total Health, 36% of Live Life, and 30% of Mena Health. Analogously, we can conclude that 48% of the individuals are dissatisfied with their health insurance agencies, 36% said they were neutral, and only 16% said they were satisfied. Regarding the joint relative frequency distribution of the variables being studied (a contingency table), we can state that 20% of the individuals are members of Total Health and are dissatisfied. The same logic is applied to the other categories of the contingency table. Conversely, Table 4.E.3 shows the joint relative frequency distribution of the variables being studied in relation to the total of each row.
TABLE 4.E.3 Joint Relative Frequency Distribution of the Variables Being Studied in Relation to the Total of Each Row Level of Satisfaction Agency
Dissatisfied
Neutral
Satisfied
Total
Total Health
58.8%
23.5%
17.6%
100%
Live Life
44.4%
33.3%
22.2%
100%
Mena Health
40%
53.3%
6.7%
100%
Total
48%
36%
16%
100%
From Table 4.E.3, we can see that the ratio of individuals who are members of Total Health and who are dissatisfied is 58.8% (40/ 68), those who are neutral is 23.5% (16/68); and those who are satisfied is 17.6% (12/68). The sum of the ratios in the respective row is 100%. The same logic is applied to the other rows. Finally, Table 4.E.4 shows the joint relative frequency distribution of the variables being studied in relation to the total of each column. Therefore, the ratio of individuals who are members of Total Health and who are dissatisfied is 41.7% (40/96), members of Live Life, 33.3% (32/96), and members of Mena Health, 25% (24/96). The sum of the ratios in the respective column is 100%. The same logic is applied to the other columns.
TABLE 4.E.4 Joint Relative Frequency Distribution of the Variables Being Studied in Relation to the Total of Each Column Level of Satisfaction Agency
Dissatisfied
Neutral
Satisfied
Total
Total Health
41.7%
22.2%
37.5%
34%
Live Life
33.3%
33.3%
50%
36%
Mena Health
25%
44.4%
12.5%
30%
Total
100%
100%
100%
100%
95
96
PART
II Descriptive Statistics
Creating Contingency Tables on the SPSS Software The contingency tables in Example 4.1 will be generated by using SPSS. The use of the images in this chapter has been authorized by the International Business Machines Corporation©. First, we are going to define the properties of each variable on SPSS. The variables Agency and Level of satisfaction are qualitative, but, initially, they are presented as numbers, as shown in the file HealthInsurance_NoLabel.sav. Thus, labels corresponding to each category of both variables must be created, so that: Labels of the variable Agency: 1 ¼ Total Health 2 ¼ Live Life 3 ¼ Mena Health Labels of the variable Level of satisfaction, simply called Satisfaction: 1 ¼ Dissatisfied 2 ¼ Neutral 3 ¼ Satisfied Therefore, we must click on Data → Define Variable Properties… and select the variables that interest us, as seen in Figs. 4.2 and 4.3.
FIG. 4.2 Defining the properties of the variable on SPSS.
Bivariate Descriptive Statistics Chapter
4
FIG. 4.3 Selecting the variables that interest us.
Next, we must click on Continue. Based on Figs. 4.4 and 4.5, note that the variables Agency and Satisfaction were defined as nominal. This definition can also be done in the environment Variable View. The definition of the labels must be created at this moment, as shown in Figs. 4.4 and 4.5. Clicking on OK, the database initially represented as numbers starts being substituted for the respective labels. In the file HealthInsurance.sav, the data have already been labeled. To create contingency tables (cross tabulation), we are going to click on the menu Analyze → Descriptive Statistics → Crosstabs…, as shown in Fig. 4.6. We are going to select the variable Agency in Row(s) and the variable Satisfaction in Column(s). Next, we must click on Cells, as shown in Fig. 4.7. To create contingency tables that represent the joint absolute frequency distribution of the variables observed, the joint relative frequency distribution in relation to the general total, the joint relative frequency distribution in relation to the total of each row, and the joint relative frequency distribution in relation to the total of each column (Tables 4.1–4.4) we must, from the Crosstabs: Cell Display dialog box (opened after we clicked on Cells…), select the option Observed in Counts and options Row, Column and Total in Percentages, as shown in Fig. 4.8. Finally, we are going to click on Continue and OK. The contingency table (cross tabulation) generated by SPSS is shown in Fig. 4.9. Note that the data generated are exactly the same as those presented in Tables 4.1–4.4.
97
98
PART
II Descriptive Statistics
FIG. 4.4 Defining the labels of variable Agency.
FIG. 4.5 Defining the labels of variable Satisfaction.
FIG. 4.6 Creating contingency tables (cross tabulation) on SPSS.
FIG. 4.7 Creating a contingency table.
100
PART
II Descriptive Statistics
FIG. 4.8 Creating contingency tables from the Crosstabs: Cell Display dialog box.
FIG. 4.9 Cross classification table (cross tabulation) generated by SPSS.
Bivariate Descriptive Statistics Chapter
4
101
Creating Contingency Tables on the Stata Software In Chapter 03, we learned how to create frequency distribution tables for a single variable on Stata through the command tabulate, or simply tab. In the case of two or more variables, if the objective is to create univariate frequency distribution tables for each variable being analyzed, we must use the command tab1, followed by the list of variables. The same logic must be applied to create joint frequency distribution tables (contingency tables). To create a contingency table on Stata from the absolute frequencies of the variables being observed, we must use the following syntax: tabulate variable1* variable2*
or simply: tab variable1* variable2* where the terms variable1* and variable2* must be substituted for the names of the respective variables.
If, in addition to the joint absolute frequency distribution of the variables being observed, we want to obtain the joint relative frequency distribution in relation to the total of each row, to the total of each column, and to the general total, we must use the following syntax: tabulate variable1* variable2*, row column cell
or simply: tab variable1* variable2*, r co ce
Consider a case with more than two variables being studied, in which the objective is to construct bivariate frequency distribution tables (two-way tables), for all the combinations of variables, two by two. In this case, we must use the command tab2, with the following syntax: tab2 variables* where the term variables* should be substituted for the list of variables being considered in the analysis.
Analogously, to obtain both the joint absolute frequency distribution and the joint relative frequency distributions per row, per column, and per general total, we must use the following syntax: tab2 variables*, r co ce
The contingency tables in Example 4.1 will be generated now by using the Stata software. The data are available in the file HealthInsurance.dta. Hence, to obtain the table of joint absolute frequency distribution, relative frequencies per row, relative frequencies per column, and relative frequencies per general total, the command is: tab agency satisfaction, r co ce
The results can be seen in Fig. 4.10 and are similar to those presented in Fig. 4.9 (SPSS). FIG. 4.10 Contingency table constructed on Stata.
102
PART
4.2.2
II Descriptive Statistics
Measures of Association
The main measures that represent the association between two qualitative variables are: a) The chi-square statistic (w2)—used for nominal and ordinal qualitative variables; b) The Phi coefficient, the contingency coefficient and Cramer’s V coefficient—applied to nominal variables and based on chi-square; and c) Spearman’s coefficient—used for ordinal variables.
4.2.2.1 Chi-Square Statistic The chi-square statistic (w2) measures the discrepancy between the contingency table observed and the contingency table expected, starting from the hypothesis that there is no association between the variables studied. If the frequency distribution observed is exactly equal to the frequency distribution expected, the result of the chi-square statistic is zero. Therefore, a value lower than w2 indicates independence between the variables. Statistic w2 is given by: 2 I X J X Oij Eij 2 (4.1) w ¼ Eij i¼1 j¼1 where: Oij: number of observations in the ith position of variable X and in the jth position of variable Y; Eij: expected frequency of observations in the ith position of variable X and in the jth position of variable Y; I: number of categories (rows) of variable X; J: number of categories (columns) of variable Y.
Example 4.2 Calculate the w2 statistic for Example 4.1. Solution Table 4.E.5 shows the observed values in the distribution with the respective relative frequencies in relation to the general total of the row. The calculation could also be done in relation to the general total of the column, arriving at the same result of the w2 statistic.
TABLE 4.E.5 Observed Values of Each Category With the Respective Ratios in Relation to the General Total of the Row Level of Satisfaction Agency
Dissatisfied
Neutral
Satisfied
Total
Total Health
40 (58.8%)
16 (23.5%)
12 (17.6%)
68 (100%)
Live Life
32 (44.4%)
24 (33.3%)
16 (22.2%)
72 (100%)
Mena Health
24 (40%)
32 (53.3%)
4 (6.7%)
60 (100%)
Total
96 (48%)
72 (36%)
32 (16%)
200 (100%)
The data in Table 4.E.5 show the dependence between the variables. Assuming that there was no association between the variables, we would expect a ratio of 48% in relation to the total of the row of all three health insurance companies in the Dissatisfied column, 36% in the Neutral column, and 16% in the Satisfied column. The calculation of the expected values can be seen in Table 4.E.6. For example, the calculation of the first cell is 0.48 68 ¼ 32.64.
Bivariate Descriptive Statistics Chapter
4
103
TABLE 4.E.6 Expected Values in Table 4.E.5, Assuming the Nonassociation Between the Variables Level of Satisfaction Agency
Dissatisfied
Neutral
Satisfied
Total
Total Health
32.6 (48%)
24.5 (36%)
10.9 (16%)
68 (100%)
Live Life
34.6 (48%)
25.9 (36%)
11.5 (16%)
72 (100%)
Mena Health
28.8 (48%)
21.6 (36%)
9.6 (16%)
60 (100%)
Total
96 (48%)
72 (36%)
32 (16%)
200 (100%)
To calculate the w2 statistic, we must apply expression (4.1) for the data in Tables 4.E.5 and 4.E.6. The calculation of each term 2 ðOij Eij Þ is shown in Table 4.E.7, jointly with the w2 measure resulting from the sum of the categories. Eij
TABLE 4.E.7 Calculating the x2 Statistic Level of Satisfaction Agency
Dissatisfied
Neutral
Satisfied
Total Health
1.66
2.94
0.12
Live Life
0.19
0.14
1.74
Mena Health
0.80
5.01
3.27
Total
w ¼ 15.861 2
As we are going to study in Chapter 9, which discusses hypotheses tests, significance level a indicates the probability of rejecting a certain hypothesis when it is true. P-value, on the other hand, represents the probability associated to the sample observed value, indicating the lowest significance level that would lead to the rejection of the supposed hypothesis. In other words, P-value represents a decreasing reliability index of a result. The lower the value, the less we can believe in the assumed hypothesis. In the case of the w2 statistic, whose test presupposes the nonassociation between the variables being studied, most statistical software, including SPSS and Stata, calculate the corresponding P-value. Thus, for a confidence level of 95%, if P-value < 0.05, the hypothesis is rejected and we can state that there is an association between the variables. On the other hand, if P-value > 0.05, we conclude that the variables are independent. All of these concepts will be studied in more detail in Chapter 9. Excel calculates the P-value of the w2 statistic through the CHITEST or CHISQ.TEST (Excel 2010 and future versions) functions. In order to do that, we just need to select the set of cells corresponding to the observed or real values and the set of cells of the expected values. Solving the chi-square statistic on the SPSS software Analogous to Example 4.1, calculating the chi-square statistic (w2) on SPSS is also done on the tab Analyze → Descriptive Statistics → Crosstabs…. Once again, we are going to select the variable Agency in Row(s) and the variable Satisfaction in Column(s). Initially, to generate the observed values and the expected values in case of nonassociation between the variables (data in Tables 4.E.5 and 4.E.6), we must click on Cells… and select the options Observed and Expected in Counts, from the Crosstabs: Cell Display dialog box (Fig. 4.11). In the same box, to generate the adjusted standardized residuals, we must select the option Adjusted standardized in Residuals. The results can be seen in Fig. 4.12. To calculate the w2 statistic, in Statistics…, we must select the option Chi-square (Fig. 4.13). Finally, we are going to click on Continue and OK. The result can be seen in Fig. 4.14. Based on Fig. 4.14, we can see that the value of w2 is 15.861, similar to the one calculated in Table 4.E.7. We can also observe that the lowest significance level that would lead to the rejection of the nonassociation hypothesis between the variables (P-value) is 0.003. Since 0.003 < 0.05 (for a confidence level of 95%), the null hypothesis is rejected, which allows us to conclude that there is association between the variables.
104
PART
II Descriptive Statistics
FIG. 4.11 Creating the contingency table with the observed frequencies, the expected frequencies, and the residuals.
FIG. 4.12 Contingency table with the observed values, the expected values, and the residuals, assuming the nonassociation between the variables.
Bivariate Descriptive Statistics Chapter
4
105
FIG. 4.13 Selecting the w2 statistic.
Solving the w2 statistic on the Stata software In Section 4.2.1, we learned how to create contingency tables on Stata through the command tabulate, or simply tab. Besides the observed frequencies, this command also gives us the expected frequencies through the option expected, or simply exp, as well as the calculation of the w2 statistic using the option chi2, or simply ch. For the data in Example 4.1 available in the file HealthInsurance.dta, to obtain the observed and expected frequency distribution tables, jointly with the w2 statistic, we are going to use the following command: tab agency satisfaction, exp ch However, the command tab does not allow residuals to be generated in the output. As an alternative, the command tabchi
was developed from a tabulation module created by Nicholas J. Cox, allowing the adjusted standardized residuals to be calculated too. In order for this command to be used, we must initially type:
FIG. 4.14 Result of the w2 statistic.
106
PART
II Descriptive Statistics
FIG. 4.15 Result of the w2 statistic on Stata.
findit tabchi
and install it in the link tab_chi from http://fmwww.bc.edu/RePEc/bocode/t. After doing this, we can type the following command: tabchi agency satisfaction, a
The result is shown in Fig. 4.15 and is similar to those presented in Figs. 4.12 and 4.14 on the SPSS software. Note that, differently from the command tab, which requires the option exp so that the expected frequencies can be generated, the command tabchi already gives them to us automatically.
4.2.2.2 Other Measures of Association Based on Chi-Square The main measures of association based on the chi-square statistic (w2) are Phi, Cramer’s V coefficient, and the contingency coefficient (C), all of them applied to nominal qualitative variables. In general, an association or correlation coefficient is a measure that varies between 0 and 1, presenting value 0 when there is no relationship between the variables, and value 1 when they are perfectly related. We are going to see how each one of the coefficients studied in this section behaves in relation to these characteristics. a) Phi Coefficient The Phi coefficient is the simplest measure of association for nominal variables based on w2, and it can be expressed as follows: rffiffiffiffiffi w2 (4.2) Phi ¼ n In order for Phi to vary only between 0 and 1, it is necessary for the contingency table to have a 2 x 2 dimension.
Example 4.3 In order to offer high-quality services and meet their customers’ expectations, Ivanblue, a company in the male fashion industry, is investing in strategies to segment the market. Currently, the company has four stores in Campinas, located in the north, center, south, and east regions of the city, and sells four types of clothes: ties, shirts, polo shirts, and pants. Table 4.E.8 shows the purchase data of 20 customers, such as, the type of clothes and the location of the store. Check if there is association between the two variables using the Phi coefficient.
Bivariate Descriptive Statistics Chapter
4
107
TABLE 4.E.8 Purchase Data of 20 Customers Customer
Clothes
Region
1
Tie
South
2
Polo shirt
North
3
Shirt
South
4
Pants
North
5
Tie
South
6
Polo shirt
Center
7
Polo shirt
East
8
Tie
South
9
Shirt
South
10
Tie
Center
11
Pants
North
12
Pants
Center
13
Tie
Center
14
Polo shirt
East
15
Pants
Center
16
Tie
Center
17
Pants
South
18
Pants
North
19
Polo shirt
East
20
Shirt
Center
Solution Using the procedure described in the previous section, the value of the chi-square statistic is w2 ¼ 18.214. Therefore: Phi ¼
rffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi w2 18:214 ¼ ¼ 0:954 n 20
Since both variables have four categories, in this case the condition 0 Phi 1 is not valid, making it difficult to interpret how strong the association is. b) Contingency coefficient The contingency coefficient (C), also known as Pearson’s contingency coefficient, is another measure of association for nominal variables based on the w2 statistic, being represented by the following expression: C¼
sffiffiffiffiffiffiffiffiffiffiffiffi w2 n + w2
(4.3)
where n is the sample size. The contingency coefficient (C) has as its lowest limit the value 0, indicating that there is no relationship between the variables; however, the highest limit of C varies depending on the number of categories, so: sffiffiffiffiffiffiffiffiffiffiffi q1 0C q
(4.4)
108
PART
II Descriptive Statistics
where: q ¼ min ðI, J Þ
(4.5)
where I is the number of rows and J is the number of columns in a contingency table. qffiffiffiffiffiffiffi When C ¼ q1 q , there is a perfect association between the variables; however, this limit never assumes the value 1. Hence, two contingency coefficients can only be compared if both are defined from tables with the same number of rows and columns.
Example 4.4 Calculate the contingency coefficient (C) for the data in Example 4.3. Solution We calculate C as follows: sffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi w2 18:214 ¼ 0:690 C¼ ¼ n + w2 20 + 18:214 Since the contingency table is 4 4 (q ¼ min(4, 4) ¼ 4), the values that C can assume are in the interval: rffiffiffi 3 ! 0 C 0:866 0C 4 We can conclude that there is association between the variables. c) Cramer’s V coefficient Another measure of association for nominal variables based on the w2 statistic is Cramer’s V coefficient, calculated by: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi w2 V¼ n:ðq 1Þ
(4.6)
where q ¼ min(I, J), as presented in expression (4.5). qffiffiffiffi 2 For 2 x 2 contingency tables, expression (4.6) is going to be V ¼ wn , which corresponds to the Phi coefficient. Cramer’s V coefficient is an alternative to the Phi coefficient and to the contingency coefficient (C), and its value is always limited to the interval [0, 1], regardless of the number of categories in the rows and columns: 0V 1
(4.7)
Value 0 indicates that the variables do not have any kind of association and value 1 shows that they are perfectly associated. Therefore, Cramer’s V coefficient allows us to compare contingency tables that have different dimensions.
Example 4.5 Calculate Cramer’s V coefficient for the data in Example 4.3. Solution V¼
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi w2 18:214 ¼ ¼ 0:551 nðq 1Þ 20 3
Since 0 V 1, there is association between the variables; however, it is not considered very strong. Solution of Examples 4.3, 4.4, and 4.5 (calculation of the Phi, contingency, and Cramer’s V coefficients) by using SPSS In Section 4.2.1, we discussed how to create labels that correspond to the variable categories from the menu Data → Define Variable Properties…. The same procedure must be applied to the data in Table 4.E.8 (we cannot forget to define the variables as nominal). The file Market_Segmentation.sav gives us these data already tabulated on SPSS.
Bivariate Descriptive Statistics Chapter
4
109
FIG. 4.16 Selecting the contingency coefficient and Phi and Cramer’s V coefficients.
FIG. 4.17 Results of the contingency coefficient and Phi and Cramer’s V coefficients.
Similar to the calculation of the w2 statistic, calculating the Phi, contingency, and Cramer’s V coefficients on SPSS can also be done on the menu Analyze → Descriptive Statistics → Crosstabs…. We are going to select the variable Clothes in Row(s) and the variable Region in Column(s). In Statistics…, we are going to select the options Contingency coefficient and Phi and Cramer’s V (Fig. 4.16). Note that these coefficients are calculated for nominal variables. The results of the statistics can be seen in Fig. 4.17. For all three coefficients, the P-value of 0.033 (0.033 < 0.05) indicates that there is association between the variables being studied. Solution of Examples 4.3 and 4.5 (calculation of the Phi and Cramer’s V coefficients) by using Stata Stata calculates the Phi and Cramer’s V coefficients through the command phi. Hence, they are going to be calculated for the data in Example 4.3 available in the file Market_Segmentation.dta.
110
PART
II Descriptive Statistics
FIG. 4.18 Calculating the Phi and Cramer’s V coefficients on Stata.
In order for the phi command to be used, initially, we must type: findit phi
and install it in the link snp3.pkg from http://www.stata.com/stb/stb3/. After doing this, we can type the following command: phi clothes region
The results can be seen in Fig. 4.18. Note that the Phi coefficient on Stata is called Cohen’s w. Cramer’s V coefficient, on the other hand, is called Cramer’s phi-prime.
4.2.2.3 Spearman’s Coefficient Spearman’s coefficient (rsp) is a measure of association between two ordinal qualitative variables. Initially, we must sort the set of data of variable X and of variable Y in ascending order. After sorting the data, it is possible to create ranks or rankings, denoted by k (k ¼ 1, …, n). Assigning ranks is something done separately for each variable. Rank 1 is then assigned to the smallest value of the variable, rank 2 to the second smallest value, and so on, and so forth, up until ranking n for the highest value. In case of a tie between values k and k +1, we must assign ranking k + 1/2 to both observations. Calculating Spearman’s coefficient can be done by using the following expression: 6 rsp ¼ 1 where:
n X
dk2
k¼1
n:ðn2 1Þ
(4.8)
n: number of observations (pairs of values); dk: difference between the rankings of order k. Spearman’s coefficient is a measure that varies between 1 and 1. If rsp ¼ 1, all the values of dk are null, indicating that all the rankings are equal to variables X and Y (perfect positive association). The value rsp ¼ 1 is found when Pn 2 n:ðn2 1Þ reaches its maximum value (there is an inversion in the values of the variable rankings), indicating a k¼1 dk ¼ 3 perfect negative association. When rsp ¼ 0, there is no association between variables X and Y. Fig. 4.19 shows a summary of this interpretation. This interpretation is similar to Pearson’s association coefficient, which will be studied in Section 4.3.3.2. FIG. 4.19 Interpretation coefficient.
of
Spearman’s
Bivariate Descriptive Statistics Chapter
4
111
Example 4.6 The coordinator of the Business Administration course is analyzing if there is any kind of association between the grades of 10 students in two different subjects: Simulation and Finance. The data regarding this problem are presented in Table 4.E.9. Calculate Spearman’s coefficient.
TABLE 4.E.9 Grades in the Subjects Simulation and Finance of the 10 Students Being Analyzed Grades Student
Simulation
Finance
1
4.7
6.6
2
6.3
5.1
3
7.5
6.9
4
5.0
7.1
5
4.4
3.5
6
3.7
4.6
7
8.5
6.8
8
8.2
7.5
9
3.5
4.2
10
4.0
3.3
Solution To calculate Spearman’s coefficient, first, we are going to assign rankings to each category of each variable depending on their respective values, as shown in Table 4.E.10.
TABLE 4.E.10 Ranks in the Subjects Simulation and Finance of the 10 Students Rankings Student
Simulation
Finance
dk
d2k
1
5
6
1
1
2
7
5
2
4
3
8
8
0
0
4
6
9
3
9
5
4
2
2
4
6
2
4
2
4
7
10
7
3
9
8
9
10
1
1
9
1
3
2
4
10
3
1
2
4
Sum
40
112
PART
II Descriptive Statistics
Applying expression (4.8), we have: n X
dk2 6 40 k¼1 ¼1 ¼ 0:7576 rsp ¼ 1 nðn2 1Þ 10 99 6
Value 0.758 indicates a strong positive association between the variables. Calculating Spearman’s coefficient using SPSS software File Grades.sav shows the data from Example 4.6 (rankings in Table 4.E.9) tabulated in an ordinal scale (defined in the environment Variable View). Similar to the calculation of the w2 statistic and the Phi, contingency, and Cramer’s V coefficients, Spearman’s coefficient can also be generated by SPSS from the menu Analyze → Descriptive Statistics → Crosstabs…. We are going to select the variable Simulation in Row(s) and the variable Finance in Column(s). In Statistics…, we are going to select the option Correlations (Fig. 4.20). We are going to click on Continue and then, finally, on OK. The result of Spearman’s coefficient is shown in Fig. 4.21. The P-value 0.011 < 0.05 (under the hypothesis of nonassociation between the variables) indicates that there is a correlation between the grades in Simulation and Finance, with 95% confidence. Spearman’s coefficient can also be calculated in the menu Analyze → Correlate → Bivariate…. We must select the variables that interest us, in addition to Spearman’s coefficient, as shown in Fig. 4.22. We are going to click on OK, resulting in Fig. 4.23. FIG. 4.20 Calculating Spearman’s coefficient from the Crosstabs: Statistics dialog box.
FIG. 4.21 Result of Spearman’s coefficient from the Crosstabs: Statistics dialog box.
Bivariate Descriptive Statistics Chapter
4
113
Calculating Spearman’s coefficient by using Stata software In Stata, Spearman’s coefficient is calculated using the command spearman. Therefore, for the data in Example 4.6, available in the file Grades.dta, we must type the following command: spearman simulation finance The results can be seen in Fig. 4.24.
FIG. 4.22 Calculating Spearman’s coefficient from the Bivariate Correlations dialog box.
FIG. 4.23 Result of Spearman’s coefficient from the Bivariate Correlations dialog box. FIG. 4.24 Result of Spearman’s coefficient on Stata.
114
PART
4.3
II Descriptive Statistics
CORRELATION BETWEEN TWO QUANTITATIVE VARIABLES
In this section, the main objective is to assess if there is a relationship between the quantitative variables being studied, besides the level of correlation between them. This can be done through frequency distribution tables, graphical representations, such as, scatter plots, in addition to measures of correlation, such as, the covariance and Pearson’s correlation coefficient.
4.3.1
Joint Frequency Distribution Tables
The same procedure presented for qualitative variables can be used to represent the joint distribution of quantitative variables and to analyze the possible relationships between the respective variables. Analogous to the study of the univariate descriptive statistic, continuous data that do not repeat themselves with a certain frequency can be grouped into class intervals.
4.3.2
Graphical Representation Through a Scatter Plot
The correlation between two quantitative variables can be represented in a graphical way through a scatter plot. It graphically represents the values of variables X and Y in a Cartesian plane. Therefore, a scatter plot allows us to assess: a) Whether there is any relationship between the variables being studied or not; b) The type of relationship between the two variables, that is, the direction in which variable Y increases or decreases depending on changes in X; c) The level of relationship between the variables; d) The nature of the relationship (linear, exponential, among others). Fig. 4.25 shows a scatter plot in which the relationship between variables X and Y is strong positive linear, that is, variations in Y are directly proportional to variations in X. The level of relationship between the variables is strong and the nature is linear. If all the points are contained in a straight line, we have a case in which the relationship is perfect linear, as shown in Fig. 4.26. Figs. 4.27 and 4.28, on the other hand, show a scatter plot in which the relationship between variables X and Y is strong negative linear and perfect negative linear, respectively. FIG. 4.25 Strong positive linear relationship.
FIG. 4.26 Perfect positive linear relationship.
Bivariate Descriptive Statistics Chapter
4
115
FIG. 4.27 Strong negative linear relationship.
FIG. 4.28 Perfect negative linear relationship.
FIG. 4.29 There is no relationship between variables X and Y.
Finally, we may now have a case in which there is no relationship between variables X and Y, as shown in Fig. 4.29. Constructing a scatter plot on SPSS
Example 4.7 Let us open file Income_Education.sav on SPSS. The objective is to analyze the correlation between the variables Family Income and Years of Education through a scatter plot. In order to do that, we are going to click on Graphs ! Legacy Dialogs ! Scatter/Dot… (Fig. 4.30). In the window Scatter/Dot in Fig. 4.31, we are going to select the type of chart (Simple Scatter). Clicking on Define, the Simple Scatterplot dialog box will open, as shown in Fig. 4.32. We are going to select the variable FamilyIncome in the Y-axis and the variable YearsofEducation in the X-axis. Next, we are going to click on OK. The scatter plot created is shown in Fig. 4.33. Based on Fig. 4.33, we can see a strong positive correlation between the variables Family Income and Years of Education. Therefore, the higher the number of years of education, the higher the family income will be, even if there is no cause and effect relationship.
116
PART
II Descriptive Statistics
FIG. 4.30 Constructing a scatter plot on SPSS.
FIG. 4.31 Selecting the type of chart.
The scatter plot can also be created in Excel by selecting the option Scatter. Constructing a scatter plot on Stata The data from Example 4.7 are also available on Stata from the file Income_Education.dta. The variables being studied are called income and education. The scatter plot on Stata is created using the command twoway scatter (or simply tw sc) followed by the variables we are interested in. Thus, to analyze the correlation between the variables Family Income and Years of Education through a scatter plot on Stata, we must type the following command: tw sc income education
The resulting scatter plot is shown in Fig. 4.34.
FIG. 4.32 Simple Scatterplot dialog box.
FIG. 4.33 Scatter plot of the variables Family Income and Years of Education.
6000
5000
Family income
4000
3000
2000
1000
0 4.0
5.0
6.0 7.0 8.0 Years of education
9.0
10.0
118
PART
II Descriptive Statistics
FIG. 4.34 Scatter plot on Stata.
5000
Family income
4000
3000
2000
1000
0 5
6
7
8
9
Years of education
4.3.3
Measures of Correlation
The main measures of correlation, used for quantitative variables, are the covariance and Pearson’s correlation coefficient.
4.3.3.1 Covariance Covariance measures the joint variation between two quantitative variables X and Y, and it is calculated by using the following expression: n X
covðX, Y Þ ¼
Xi X : Y i Y
i¼1
n1
(4.9)
where: Xi: ith value of X; Yi: ith value of Y; X: mean of the values of Xi; Y: mean of the values of Yi; n: sample size. One of the limitations of the covariance is that the measure depends on the sample size, and it may lead to a bad estimate in the case of small samples. Pearson’s correlation coefficient is an alternative for this problem. Example 4.8 Once again, consider the data in Example 4.7 regarding the variables Family Income and Years of Education. The data are also available in Excel in the file Income_Education.xls. Calculate the covariance of the data matrix of both variables. Solution Applying expression (4.9), we have: ð7:6 7:08Þð1, 961 1, 856:22Þ + ⋯ + ð5:4 7:08Þð775 1, 856:22Þ 72, 326:93 ¼ ¼ 761:336 95 95 The covariance can be calculated in Excel by using the COVARIANCE.S (sample) function. In the following section, we are also going to discuss how the covariance can be calculated on SPSS, jointly with Pearson’s correlation coefficient. SPSS considers the same expression presented in this section. covðX, Y Þ ¼
Bivariate Descriptive Statistics Chapter
4
119
FIG. 4.35 Interpretation of Pearson’s correlation coefficient.
4.3.3.2 Pearson’s Correlation Coefficient Pearson’s correlation coefficient (r) is a measure that varies between 1 and 1. Through the sign, it is possible to verify the type of linear relationship between the two variables analyzed (the direction in which variable Y increases or decreases depending on how X changes); the closer it is to the extreme values, the stronger the correlation between them. Therefore: – If r is positive, there is a directly proportional relationship between the variables; if r ¼ 1, we have a perfect positive linear correlation. – If r is negative, there is an inversely proportional relationship between the variables; if r ¼ 1, we have a perfect negative linear correlation. – If r is null, there is no correlation between the variables. Fig. 4.35 shows a summary of the interpretation of Pearson’s correlation coefficient. Pearson’s correlation coefficient (r) can be calculated as a ratio between the covariance of two variables and the product of the standard deviations (S) of each one of them: n X
Xi X : Yi Y
i¼1
covðX, Y Þ n1 ¼ r¼ S X SY SX SY rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn Pn 2 2 X X ð Þ ðYi Y Þ i i¼1 i¼1 Since SX ¼ and SY ¼ , as we studied in Chapter 3, expression (4.10) becomes: n1 n1 n X
Xi X : Yi Y
i¼1
r ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n n X 2 X 2 Xi X : Yi Y i¼1
(4.10)
(4.11)
i¼1
In Chapter 12, we are going to use Pearson’s correlation coefficient a lot, when studying factorial analysis. Example 4.9 Once again, open the file Income_Education.xls and calculate Pearson’s correlation coefficient between the two variables. Solution Calculating Pearson’s correlation coefficient through expression (4.10) is as follows: r¼
covðX, Y Þ 761:336 ¼ 0:777 ¼ S X SY 970:774 1:009
This calculation could also be done by using expression (4.11), which does not depend on the sample size. The result indicates a strong positive correlation between the variables Family Income and Years of Education.
120
PART
II Descriptive Statistics
FIG. 4.36 Bivariate Correlations dialog box.
Excel also calculates Pearson’s correlation coefficient through the PEARSON function. Solution of Examples 4.8 and 4.9 (calculation of the covariance and Pearson’s correlation coefficient) on SPSS Once again, open the file Income_Education.sav. To calculate the covariance and Pearson’s correlation coefficient on SPSS, we are going to click on Analyze ! Correlate ! Bivariate…. The Bivariate Correlations window will open. We are going to select the variables Family Income and Years of Education, in addition to Pearson’s correlation coefficient, as shown in Fig. 4.36. In Options…, we must select the option Cross-product deviations and covariances, according to Fig. 4.37. We are going to click on Continue and then on OK. The results of the statistics are presented in Fig. 4.38. FIG. 4.37 Selecting the covariance statistic.
Bivariate Descriptive Statistics Chapter
4
121
FIG. 4.38 Results of the covariance and of Pearson’s correlation coefficient on SPSS.
FIG. 4.39 Calculating Pearson’s correlation coefficient on Stata.
FIG. 4.40 Calculating the covariance on Stata.
Analogous to Spearman’s coefficient, Pearson’s correlation coefficient can also be generated on SPSS from the menu Analyze → Descriptive Statistics → Crosstabs… (option Correlations in the Statistics button…). Solution of Examples 4.8 and 4.9(calculation of the covariance and Pearson’s correlation coefficient) on Stata To calculate Pearson’s correlation coefficient on Stata, we must use the command correlate, or simply corr, followed by the list of variables we are interested in. The result is the correlation matrix between the respective variables. Once again, open the file Income_Education.dta. Thus, for the data in this file, we can type the following command: corr income education
The result can be seen in Fig. 4.39. To calculate the covariance, we must use the option covariance, or only cov, at the end of the command correlate (or simply corr). Thus, to generate Fig. 4.40, we must type the following command: corr income education, cov
4.4
FINAL REMARKS
This chapter presented the main concepts of descriptive statistics with greater focus on the study of the relationship between two variables (bivariate analysis). We studied the relationships between two qualitative variables (associations) and between two quantitative variables (correlations). For each situation, several measures, tables, and charts were presented, which allow us to have a better understanding of the data behavior. Fig. 4.1 summarizes this information.
122
PART
II Descriptive Statistics
The construction and interpretation of frequency distributions, graphical representations, in addition to summary measures (measures of position or location and measures of dispersion or variability), allow the researcher to have a better understanding and visualization of the data behavior for two variables simultaneously. More advanced techniques can be applied in the future to the same set of data, so that researchers can go deeper in their studies on bivariate analysis, aiming at improving the quality of the decision making process.
4.5
EXERCISES
1) Which descriptive statistics can be used (and in which situations) to represent the behavior of two qualitative variables simultaneously? 2) And to represent the behavior of two quantitative variables? 3) In what situations should we use contingency tables? 4) What are the differences between the chi-square statistic (w2), Phi coefficient, the contingency coefficient (C), Cramer’s V coefficient, and Spearman’s coefficient? 5) What are the main summary measures to represent the data behavior between two quantitative variables? Describe each one of them. 6) Aiming at identifying the behavior of customers who are in default regarding their payments, a survey with information on the age and level of default of the respondents was carried out. The objective is to determine if there is an association between the variables. Based on the files Default.sav and Default.dta, we would like you to: a) Create the joint frequency distribution tables for the variables age_group and default (absolute frequencies, relative frequencies in relation to the general total, relative frequencies in relation to the total of each line, relative frequencies in relation to the total of each column and the expected frequencies). b) Determine the percentage of individuals who are between 31 and 40 years of age. c) Determine the percentage of individuals who are heavily indebted. d) Determine the percentage of respondents who are 20 years old or younger and do not have debts. e) Determine, among the individuals who are older than 60, the percentage of those who are a little indebted. f) Determine, among the individuals who are a relatively indebted, the percentage of those who are between 41 and 50 years old. g) Verify if there are indications of dependence between the variables. h) Confirm the previous item using the w2 statistic. i) Calculate the Phi, contingency, and Cramer’s V coefficients, confirming whether there is an association between the variables or not. 7) The files Motivation_Companies.sav and Motivation_Companies.dta show a database with the variables Company and Level of Motivation (Motivation), obtained through a survey carried out with 250 employees (50 respondents for each one of the 5 companies surveyed), aiming at assessing the employees’ level of motivation in relation to the companies, considered to be large firms. Hence, we would like you to: a) Create the contingency tables of absolute frequencies, relative frequencies in relation to the general total, relative frequencies in relation to the total of each line, relative frequencies in relation to the total of each column and the expected frequencies; b) Calculate the percentage of respondents who are very demotivated. c) Calculate the percentage of respondents from Company A and are very demotivated. d) Calculate the percentage of motivated respondents in Company D. e) Calculate the percentage of little motivated respondents in Company C. f) Among the respondents who are very motivated, determine the percentage of those who work for Company B. g) Verify if there are indications of dependence between the variables. h) Confirm the previous item using the w2 statistic. i) Calculate the Phi, contingency, and Cramer’s V coefficients, confirming whether there is an association between the variables or not. 8) The files Students_Evaluation.sav and Students_Evaluation.dta show the grades, from 0 to 10, of 100 students from a public university in relation to the following subjects: Operational Research, Statistics, Operations Management, and Finance. Check and see if there is a correlation between the following pairs of variables, constructing the scatter plot and calculating Pearson’s correlation coefficient: a) Operational Research and Statistics; b) Operations Management and Finance. c) Operational Research and Operations Management.
Bivariate Descriptive Statistics Chapter
4
123
9) The files Brazilian_Supermarkets.sav and Brazilian_Supermarkets.dta show revenue data and the number of stores of the 20 largest Brazilian supermarket chains in a given year (source: ABRAS - Brazilian Association of Supermarkets). We would like you to: a) Create the scatter plot for the variables revenue x number of stores. b) Calculate Pearson’s correlation coefficient between the two variables. c) Exclude the four largest supermarket chains in terms of revenue, as well as the chain AM/PM Food and Beverages Ltd., and once again create the scatter plot. d) Once again, calculate Pearson’s correlation coefficient between the two variables being studied.