Computational Statistics & Data Analysis 44 (2003) 211 – 222 www.elsevier.com/locate/csda
A measure of association for complex data Seung-Chun Leea;1 , Moon Yul Huhb;∗ a Department
b Department
of Statistics, Hanshin University, Osan, Kyunggi-Do, South Korea of Statistics, Sung Kyun Kwan University, Myungryun-Dong, Chongro-Ku, 110-745, Seoul, South Korea
Received 5 August 2002; received in revised form 9 December 2002; accepted 9 December 2002
Abstract A measure of association for complex data types is proposed based on the measure of departure from independence using the p-value of a statistical independence test. The measure is numerically shown to be comparable to Silvey’s general measure of association. It is demonstrated with real data sets of complex data types that the measure works e4ciently for the decision tree and the logistic regression at the initial stage of variable selection. c 2003 Elsevier B.V. All rights reserved. Keywords: Measure of association; Independence; Complex data; p-value; Rank correlation coe4cient; Kruskal-Wallis test; Pearson’s X 2 test
1. Introduction Measure of association is one of the topics that receive most attention from statisticians. A number of studies have been done for measure of association. Correlation coe4cient, rank correlation coe4cient and Kendall’s tau (Kendall, 1938) may be the most notable association measures. Other types of measure of association are de=ned in the contingency tables. These include Cram>er’s V (Cram>er, 1946), Goodman and Kruskal’s tau (Goodman and Kruskal, 1954), Cohen’s kappa (Cohen, 1960), and Yule’s Q (Yule, 1912). However, it should be noted that these association measures are limited to the cases when the two random variables are of the same type: continuous ∗
Corresponding author. Tel.: +82-2-760-0463; fax: +82-2-764-0622. E-mail address:
[email protected] (M.Y. Huh). 1 Supported by Hanshin University research fund, 2003.
c 2003 Elsevier B.V. All rights reserved. 0167-9473/03/$ - see front matter doi:10.1016/S0167-9473(03)00031-8
212
S.-C. Lee, M.Y. Huh / Computational Statistics & Data Analysis 44 (2003) 211 – 222
or discrete. They do not intend to measure the association between diHerent types of variables: one is a continuous type and the other is a discrete type. It is common today, however, that the data sets used in the statistical data analysis, especially in the =eld of data mining, are becoming diverse and the problem is to understand the relationship among the variables of complex data types. Silvey (1964) proposed a general measure of association between random variables X and Y which is as follows: [p(x; y) − p(x)p(y)] d x dy; = {(x;y); (x;y)¿1}
where (x; y)=p(x; y)=p(x)p(y), and p(x; y); p(x) and p(y) are the joint and marginal densities of X and Y , respectively. is basically an extension of the concept of the measure of information introduced by Shannon (1948) in the context of communication theory. The measure is distribution-dependent, and we need to estimate the relevant densities for the measure which, practically, is another serious problem. Thus the measure is not used as a measure of association in practice. For a sample version of , we propose T () = 1 − pT (), where pT () denotes the p-value of an independence test T when Silvey’s measure of association between bivariate random variables is given by . Larger value of T () implies higher measure of dependence, and hence higher measure of association. This can be considered as a measure of departure from independence (MDI). For T () to be a reasonable measure of association, it should satisfy the following properties. [Properties of T ()] • The measure should be a non-decreasing function of . • The measure should satisfy the monotonic property for the diHerent tests applied in the following sense. For 2 ¿ 1 and for two tests T1 and T2 , the following probability is monotonically increasing in 2 − 1 . Pr[ T1 (1 ) 6 T2 (2 )]:
(1)
For T () to be a measure of association for complex data types, we propose using the following three tests for T () depending upon the types of variables. • continuous–continuous variables: Spearman’s rank correlation test. • discrete–discrete variables: Pearson’s X 2 test. • continuous–discrete variables: Kruskal-Wallis test. Our choice of the above three tests is based on the fact that these tests share common properties in the sense that they all can be obtained from the component estimates of Pearson’s 2 distance measure. See Eubank et al. (1987). In the following section, we numerically show that T () is a non-decreasing function of for our choice of three tests, and also show the measure satis=es the monotonic property for the three tests as de=ned in Eq. (1). Then we present two real world data with complex data types, and apply the measure to arrange the variables in the
S.-C. Lee, M.Y. Huh / Computational Statistics & Data Analysis 44 (2003) 211 – 222
213
0.8 continuous-continuous continuous-discrete discrete-discrete
∆
0.6
0.4
0.2
0.0 0.0
0.2
0.4
0.6
0.8
1.0
Fig. 1. vs. for the three diHerent con=gurations of the two random variables.
order of their strength of association with the target variable for the problems of logistic regression and decision tree.
2. Simulation study For our simulation study, we use bivariate normal distribution. Since Silvey (1964) has shown that is a nondecreasing function of || in the bivariate normal case, we can safely use instead of as a measure of association. We denote the three MDI’s from Spearman’s rank correlation test, Kruskal-Wallis test and the Pearson’s X 2 test by S , K and C , respectively. We take 100 random samples, (Xi ; Yi ); i = 1; 2; : : : ; 100 from a bivariate normal distribution with E(Xi ) = E(Yi ) = 0, Var(Xi ) = Var(Yi ) = 1 and Corr(Xi ; Yi ) = . For K , we divide the normal random variable Y into 4 intervals with equal probabilities so that the variable X is continuous and Y is discrete. Similarly, for C , we divide both variables into 4 intervals with equal probabilities. We consider 10 diHerent values of = 0; 0:1; : : : ; 0:9. For each combination of , we use 1,000 replications to compute the relevant statistics. It could be expected that discretizing a continuous variable into a discrete variable would incur the loss of information inherent in the variable. Hence the degree of association between the two variables will be weakened by discretizing at least one of the two variables. To demonstrate this eHect, we calculate Silvey’s for diHerent values for the three diHerent con=gurations of the two random variables: continuous– continuous; continuous–discrete; discrete–discrete. The result is shown in Fig. 1 and it shows the discretization weakens the association between the two variables. Next, in Fig. 2, empirical distributions of the three MDI’s are presented for 10 diHerent values of ’s. Here we apply a linearizing transformation of MDI so that we
214
S.-C. Lee, M.Y. Huh / Computational Statistics & Data Analysis 44 (2003) 211 – 222 1.0
1.0
0.8
0.8
0.6
0.6
St
Kt 0.4
0.4
0.2
0.2
0.0
0.0 0.0
0.1
0.2
0.3
0.4
(a)
0.5
0.6
0.7
0.8
0.9
0.0
0.1
0.2
0.3
0.4
0.0
0.1
0.2
0.3
0.4
(b) 1.0
0.8
0.8
0.6
0.6
, Cramèr V
1.0
Ct 0.4
0.2
0.5
0.6
0.7
0.8
0.9
0.5
0.6
0.7
0.8
0.9
0.4
0.2
0.0
0.0 0.0
(c)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
(d)
Fig. 2. Boxplots of MDI’s and Cramer’s V vs. for three tests and Cramer’s V , respectively. (a) Spearman’s Rank Correlation test, (b) Kruskal-Wallis test, (c) Pearson’s X 2 test and (d) Cramer’s V .
can better understand the plot. The transformation is as follows: t = √
1 −1 ((1 + )=2): n−1
The background of this transformation is from the result √ that for large n, p-value of the rank correlation test is approximated by 2 × (1 − ( n − 1|˜n |)) where (·) is the distribution function of a standard normal distribution and ˜n denotes the Spearman’s rank correlation coe4cient for sample size n. Fig. 2 gives 4 plots for tS , tK , tC , and Cram>er’s V , respectively. The distribution of Cram>er’s V is provided for comparison with the distribution of tC . Fig. 2 shows that the trend of MDI’s are monotonically increasing functions of . In particular, Fig. 2(a) shows that the relationship between tS and is near linear. However, Fig. 2(b) and (c) do not show linearity. This is due to the fact that for computing K and C , we divided the continuous data into 4 categories, which in turn incurs the loss of association inherent in the data. The result is consistent with the one given in Fig. 1.
S.-C. Lee, M.Y. Huh / Computational Statistics & Data Analysis 44 (2003) 211 – 222
215
Table 1 Simulated values of Pr[ S (1 ) ¡ K (2 )] with 1000 iterations
2 1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.521 0.393 0.172 0.028 0.007 0.000 0.000 0.000 0.000 0.000
0.590 0.462 0.208 0.053 0.009 0.001 0.000 0.000 0.000 0.000
0.765 0.644 0.400 0.149 0.034 0.004 0.000 0.000 0.000 0.000
0.916 0.842 0.623 0.308 0.111 0.020 0.002 0.000 0.000 0.000
0.968 0.951 0.829 0.603 0.307 0.074 0.009 0.000 0.000 0.000
0.998 0.990 0.957 0.846 0.562 0.251 0.057 0.002 0.000 0.000
1.000 0.999 0.994 0.967 0.853 0.558 0.239 0.022 0.000 0.000
1.000 1.000 1.000 0.999 0.976 0.870 0.602 0.160 0.005 0.000
1.000 1.000 1.000 1.000 1.000 0.994 0.935 0.618 0.107 0.001
1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.978 0.769 0.424
Table 2 Simulated values of Pr[ S (1 ) ¡ C (2 )] with 1000 iterations
2 1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.515 0.381 0.177 0.036 0.002 0.000 0.000 0.000 0.000 0.000
0.566 0.453 0.221 0.050 0.011 0.000 0.000 0.000 0.000 0.000
0.708 0.571 0.299 0.100 0.024 0.002 0.000 0.000 0.000 0.000
0.820 0.714 0.479 0.194 0.062 0.005 0.000 0.000 0.000 0.000
0.936 0.865 0.685 0.388 0.178 0.029 0.000 0.000 0.000 0.000
0.989 0.955 0.860 0.612 0.323 0.103 0.012 0.000 0.000 0.000
1.000 0.993 0.976 0.891 0.675 0.356 0.108 0.017 0.002 0.000
1.000 1.000 0.997 0.980 0.924 0.722 0.420 0.125 0.017 0.004
1.000 1.000 1.000 1.000 0.997 0.977 0.855 0.535 0.182 0.076
1.000 1.000 1.000 1.000 1.000 1.000 0.999 0.990 0.945 0.885
Now we present the simulation result that the MDI’s of the three tests have the monotonicity property as described in Eq. (1). Table 1 gives the empirical results of Pr[ S (1 ) 6 K (2 )] for all combinations of 1 ; 2 when each parameter takes one of the values of 0:0; 0:1; : : : ; 0:9. Ideally, the values of the diagonal elements should be 0.5, but they are consistently less than 0.5 except for the case of 1 = 2 = 0. This is due to the discretization as described above. The upper triangular elements should be greater than 0.5 and should approach 1 as the position goes to the upper-right corner or as 2 − 1 gets larger. The lower triangular elements should be less than 0.5, and should approach 0 as the elements go to the lower-left corner. Table 1 shows all of these patterns. If we replace the lower triangular elements by 1 minus corresponding upper triangular elements, the table should be symmetric. Because of the discretization, however, the table is not exactly symmetric. Tables 2 and 3 show similar patterns. Although there are some discrepancies of the values of the table from the ideal values, general pattern shows monotonic property of the MDI’s of the three tests.
216
S.-C. Lee, M.Y. Huh / Computational Statistics & Data Analysis 44 (2003) 211 – 222
Table 3 Simulated values of Pr[ K (1 ) ¡ C (2 )] with 1000 iterations
2 1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.489 0.425 0.223 0.091 0.021 0.004 0.000 0.000 0.000 0.000
0.552 0.462 0.288 0.117 0.032 0.003 0.000 0.000 0.000 0.000
0.677 0.610 0.404 0.209 0.069 0.010 0.001 0.000 0.000 0.000
0.829 0.751 0.549 0.352 0.146 0.039 0.003 0.000 0.000 0.000
0.938 0.917 0.770 0.593 0.308 0.120 0.026 0.001 0.000 0.000
0.984 0.975 0.911 0.780 0.531 0.256 0.076 0.009 0.000 0.000
0.999 0.999 0.993 0.957 0.841 0.618 0.286 0.072 0.002 0.000
1.000 1.000 0.998 0.998 0.976 0.882 0.653 0.312 0.066 0.007
1.000 1.000 1.000 1.000 0.998 0.990 0.952 0.797 0.447 0.116
1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.991 0.908
3. Real world examples 3.1. Example 1 In a regression analysis, if there are lots of candidates for the independent variables, it is a common procedure to select independent variables having relatively high correlations with the dependent variable at the initial stage. It is not clear, however, what criterion to use when the types of variables are mixed. We have performed many experiments and have found that our proposed measure of association can be used successfully for this purpose. To illustrate the e4ciency of the measure, we present the result using the real data of labour-force participation study by Morz (1987). He examined the decision of married women to participate in labour force. The dependent variable is binary according to whether the female in question was working at the time of the survey or not. Other variables used to explain the decision include family income (excluding the wife’s earnings), woman’s education, prior job experience, woman’s age, the number of children under 6 years of age or between 6 and 18 years of age in the household. Data set consists of 753 cases with the following labels: infl: 1 if in the labour force, and 0 otherwise faminc: total family income, dollars wage: estimated wage from earnings, dollars/hour hours: hours worked educ: years of schooling, total number expr: years of job experience age: woman’s age kidslt6: the number of kids under 6 years old in the household kidsgt6: the number of kids between 6-18 years old in the household
S.-C. Lee, M.Y. Huh / Computational Statistics & Data Analysis 44 (2003) 211 – 222
217
Table 4 Computed values of for the 7 independent variables of labour force data
Variable name
Test applied
nwifeinc educ expr exprsq age kidslt6 kidsgt6
0.999322675343 0.999998954841 1.000000000000 1.000000000000 0.964471299557 0.999999881216 0.236782916243
Kruskal-Wallis Kruskal-Wallis Kruskal-Wallis Kruskal-Wallis Kruskal-Wallis Pearson’s X 2 Pearson’s X 2
Table 5 SAS output for the logistic regression model for the labour force data Analysis of Maximum Likelihood Estimates Standard Parameter DF Estimate
Error
Chi-Square
Pr>ChiSq
Intercept nwifeinc educ expr exprsq age kidslt6 kidsgt6
0.8604 0.00842 0.0434 0.0321 0.00102 0.0146 0.2036 0.0748
0.2445 6.4243 25.9228 41.2421 9.6354 36.4844 50.2637 0.6460
0.6210 0.0113 <.0001 <.0001 0.0019 <.0001 <.0001 0.4215
1 1 1 1 1 1 1 1
0.4255 −0.0213 0.2212 0.2059 −0.00315 −0.0880 −1.4434 0.0601
Many studies have been conducted to set up an appropriate model for the decision of married women to participate in labour force. Here we consider logistic regression model with infl as the dependent variable and nwifeinc, educ, expr, exprsq, age, kidslt6 and kidsgt6 as the independent variables where nwifeinc = (faminc − wage × hours)=1000; exprsq = expr × expr: We computed between the 7 independent variables and the response variable. The result is given at Table 4. First 5 values are based on the Kruskal-Wallis test and the last 2 values are based on the Pearson X 2 test. It can be observed that for kidsgt6 is considerably smaller than the ones for the other variables. The result suggests kidsgt6 should not be included in the model. This result is comparable to the result of SAS logistic regression modelling of the data given in Table 5. From the table, we can observe that the p-value of the X 2 test statistic for the nullity of the coe4cient of the variable kidsgt6 is quite large (0.4215), and hence we conclude to exclude the variable in the model.
218
S.-C. Lee, M.Y. Huh / Computational Statistics & Data Analysis 44 (2003) 211 – 222
3.2. Example 2 Decision tree is a powerful and popular data mining tool for prediction and classi=cation. One of the attractiveness of the decision tree compared with other data mining tools such as the neural network is that it represents rules which can be expressed in a piece of writing so that humans can understand the reason for a decision easily. However, too many variables, which may be a common feature of practical data sets for data mining, often cause a huge tree so that it is hard to understand the rules produced by the tree. Also the types of the variables are generally complex. In other words, target variable is usually nominal type, and the types of input variables are continuous and discrete (nominal or ordinal). There has been no uni=ed approach to measuring relative importance for input variables of mixed types against target variable. Widely used technique is =rst to discretize continuous type variables into discrete types, and then to apply some measure to order the discrete variables with respect to their degree of relationship to the target variable. As was noted in the previous section, discretization incurs loss of information inherent in the data. Hence, it is not desirable to practice discretization, if possible, at the initial stage of variable selection. A remedy to discretization is MDI. Since MDI can order the input variables in their relative importance against the target variable regardless of the types of the variables involved, we can apply MDI to select appropriate input variables for decision tree at the initial stage. We may also apply MDI at the growing stage of decision tree, but this may not be e4cient because at some stage the sample size may be too small for MDI to be e4cient. For demonstration, we will use German credit data which can be found on http://www1.ics.uci.edu/∼mlearn/MLSummary.html. The data set consist of 1000 observations on 21 variables. These 21 variables include 3 continuous variables, amount, age and duration, and 18 nominal or ordinal types of discrete variables, checking, history, purpose, savings, employed, installp, martial, coapp, resident, property, other, housing, existcr, job, depends, telephon, foreign and good bad. The purpose of this analysis is to predict the credit rating of the customers. The credit rating of the customers is modelled by SPSS AnserTree Release 2.1.1. Data set is randomly divided into 2 sets, training set and test set. Two sets consist of 635 and 365 cases, respectively. Using the training set, we =rst build a decision tree using the stopping rule governed by the minimum number of cases: 10 for the parent node and 5 for the child node. Then pruning was applied to the tree with the test data set. Fig. 3 shows the resulting tree. It has 13 terminal nodes. Performance of the tree evaluated by the test set is shown in Table 6. Now we =rst compute MDI’s of all input variables against the target variable to select appropriate explanatory variables before applying the data set to the decision tree. MDI’s are shown below in descending order. Selecting appropriate number of variables is a matter of subjective decision. However, the result suggests us to remove the 6 variables for further analysis (installp, telephon, existcr, job, resident, depends) since the gap between the next higher variable (coapp) is big.
S.-C. Lee, M.Y. Huh / Computational Statistics & Data Analysis 44 (2003) 211 – 222
219
Fig. 3. A decision tree for German credit data with all input variables.
checking 1.0000000 duration 1.0000000 history 1.0000000 savings 0.9999997 property 0.9999714 housing 0.9998883 purpose 0.9998843
age 0.9996089 employed 0.9989545 other 0.9983707 amount 0.9940845 foreign 0.9841692 marital 0.9777620 coapp 0.9639440
installp 0.8599667 telephon 0.7211238 existcr 0.5548559 job 0.4034184 resident 0.1384479 depends 0.0000000
220
S.-C. Lee, M.Y. Huh / Computational Statistics & Data Analysis 44 (2003) 211 – 222
Table 6 Confusion matrix of the tree of size 13 for German credit data
Predicted Category
0 1
Risk estimate: 0.243836 SE of risk estimate: 0.0224755
Actual category 0
1
Total
223 40 263
49 53 102
272 93 365
Fig. 4. A decision tree for German credit data using selected input variables.
Remaining 14 input variables are applied to build a decision tree. Using the same sets of training and test data sets, and the same rules for building and pruning the tree, we obtained a decision tree shown in Fig. 4. The number of terminal nodes of the tree is
S.-C. Lee, M.Y. Huh / Computational Statistics & Data Analysis 44 (2003) 211 – 222
221
Table 7 Confusion matrix of the tree of size 7 for German credit data
Predicted Category
0 1
Actual category 0
1
Total
229 34 263
56 46 102
285 80 365
Risk estimate: 0.246575 Standard error of risk estimate: 0.0225605
7 which is much smaller than the previous tree of size 13. Also, surprisingly enough, the performance of the new tree is comparable to the old tree. Table 7 shows that new tree has 90 misclassi=ed cases among 365 cases while the tree with all input variables has 89 misclassi=ed cases. Thus we prefer the tree of size 7 to the tree of size 13 on the basis of minimum description length principle. Note also that the splitting rules near root node are based on the variables with larger MDI’s. 4. Conclusion In this paper, we proposed a measure of departure from independence (MDI) as a uni=ed measure of association for complex data types. The measure is based on the p-value of independence test. Since p-value is test-dependent, the measure should also be dependent on the tests applied. We suggested three diHerent tests depending upon the types of the variables applied. We, then, empirically showed that the MDI’s from these tests have the desirable properties needed for the measure. In other words, when there is more association between one pair of random variables than the other pair of variables, then the probability of the corresponding MDI being larger than the other MDI were empirically shown to approach 1. We applied MDI to select appropriate variables at the initial stage for modelling logistic regression and for building decision tree using real data sets. We have found the result is quite eHective. At the current stage, however, our proposed MDI works e4ciently for relatively large sample sizes because these measures are derived from statistical tests. If we are able to develop MDI’s that could work for smaller sample sizes, we can apply this method to the building and pruning stage of decision tree. References Cohen, J., 1960. A coe4cient of agreement for nominal scales. Educ. Psychol. Meas. 20, 37–46. Cram>er, H., 1946. Mathematical Methods of Statistics. Princeton University Press, Princeton, New Jersey. Eubank, R.L., Lariccia, V.N., Rosenstein, R.B., 1987. Test statistics derived as components of Pearson’s Phi-squared distance measure. J. Amer. Statist. Assoc. 82, 816–825. Goodman, L.A., Kruskal, W.H., 1954. Measure of association for cross classi=cations. J. Amer. Statist. Assoc. 49, 732.
222
S.-C. Lee, M.Y. Huh / Computational Statistics & Data Analysis 44 (2003) 211 – 222
Kendall, M.G., 1938. A new measure of rank correlation. Biometrika 30, 81–93. Morz, T.A., 1987. The sensitivity of an empirical model of married women’s hours of work to economic and statistical assumptions. Econometrica 55, 765–799. Shannon, C.E., 1948. A mathematical theory of communication. Bell System Tech. J. 27, 379–423 and 623– 656. Silvey, S.D., 1964. On a measure of association. Ann. Math. Statist. 35, 1157–1166. Yule, G.U., 1912. On the methods of measuring association between two attributes (with discussion). J. Roy. Statist. Soc. 75, 579–642.