Statistics and Probability Letters xx (xxxx) xxx–xxx
Contents lists available at ScienceDirect
Statistics and Probability Letters journal homepage: www.elsevier.com/locate/stapro
Q1
Q2
A new explanatory index for evaluating the binary logistic regression based on the sensitivity of the estimated model Héctor M. Ramos ∗ , Jorge Ollero, Alfonso Suárez-Llorens Departamento de Estadística e Investigación Operativa, Facultad de CC. Económicas, av. Duque de Nájera 8, CP 11002, Cádiz, Spain
article
abstract
info
Article history: Received 17 May 2016 Received in revised form 22 August 2016 Accepted 30 August 2016 Available online xxxx Keywords: Binary logistic regression McFadden index ROC curve Sensitivity index
We propose a new explanatory index for evaluating the binary logistic regression model based on the sensitivity of the estimated model. We previously formalized the idea of sensitivity and established the principles a statistic should comply with to be considered a sensitivity index. We apply the results to a practical example and compare the results with those obtained utilizing other indices. © 2016 Published by Elsevier B.V.
1. Introduction
1
Binary logistic regression is a frequently applied procedure used to predict the probability of occurrence for some binary outcome using one or more continuous or categorical variables as predictors. The logistic model relates the probability of occurrence P of the outcome counted by Y to the predictor variables Xi , with the occurrence of an event normally indicated by one and nonoccurrence by zero. The model takes the form P (Y = 1) =
1 1 + exp[−(β0 + β1 X1 + β2 X2 + · · · + βk Xk )]
.
The regression parameters are typically obtained using maximum likelihood estimation. Hosmer and Lemeshow (2000) provide a detailed discussion of the goodness of fit of the logistic regression model, particularly the well-known and commonly used Hosmer–Lemeshow goodness-of-fit test. When the predicted probabilities resulting from logistic regression are for classification purposes, there is a need for additional indices of model fit. Known as pseudo-R2 indices, these indices play a role similar to R2 in ordinary least squares (OLS) regression. Some indices, such as those formulated by Cragg and Uhler (1970), McFadden (1974), Maddala (1983), Cox and Snell (1989), and Nagelkerke (1991) compare the likelihood functions for an intercept-only and full model. In particular, the McFadden pseudo-R2 is defined as follows: R2MF = 1 −
log L(full) log L(null)
.
McKelvey and Zavoina (1975) propose a pseudo-R2 based on a latent model structure, where the binary outcome results from discretizing a continuous latent variable relate to the predictors through a linear model. This pseudo-R2 is then the proportion of the variance of the latent variable explained by the covariate. Cameron and Windmeijer (1997) define yet
∗
Corresponding author. E-mail addresses:
[email protected] (H.M. Ramos),
[email protected] (J. Ollero),
[email protected] (A. Suárez-Llorens).
http://dx.doi.org/10.1016/j.spl.2016.08.022 0167-7152/© 2016 Published by Elsevier B.V.
2 3 4 5
6
7 8 9 10 11 12 13
14
15 16 17
2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
H.M. Ramos et al. / Statistics and Probability Letters xx (xxxx) xxx–xxx
another pseudo-R2 index as the proportionate reduction in uncertainty, as measured by the Kullback–Leibler divergence, given the inclusion of the regressors. Windmeijer (1995) and Smith and McKenna (2013) provide broad and detailed studies of the different pseudo-R2 indices available for binary choice models. Theoretical results regarding the convergence and asymptotic normality of pseudo-R2 indices are available in Hu et al. (2006). It is worth mentioning that there does not exist an equivalent statistic to the classical R2 coefficient in OLS regression when analyzing data with a logistic regression. It is well known that estimates arrived at through an iterative process and they are not computed to minimize variance, hence the OLS approach to goodness-of-fit does not apply. All previous indices are called pseudo-R2 because they look like the classical R2 in the sense that they are on a similar scale, ranging from 0 to 1, with higher values indicating better model fit. Alternatively to pseudo-R2 indices, there exist other exploratory methods for evaluating a logistic regression model. In accordance with the usual interpretation of R2 for linear models they try to capture the model’s ability to predict a single observation. Mainly, those methods take into account the differences between observed and predicted outcomes and a good model is defined when having a high explanatory power, i.e., a good prediction of an observation is only possible when the success probability is close to 1 or 0. Among others, we first highlight the coefficient of discrimination de Tjur (2009). It has a lot of intuitive appeal and the definition is very simple. For each of the two categories of the dependent variable is computed the average of the estimated probabilities and then the difference between them is computed. Its interpretation is based on the histograms of the empirical distributions of both the fitted values for the failures and the fitted values for the successes. Intuitively, the greater is the difference, the better is the model. Secondly, it is also worth to mention some indices based on the concept of concordance and discordance. Basically, concordance tells us the association between actual values and the values fitted by the model in percentage terms, namely, we compute the number of pairs where the one had a higher model score than the model score of zero and the opposite for discordance. Some examples are the classical well-known Kendall’s tau, Goodman–Kruskal Gamma and Somers’ D (Somers, 1962). Finally, another valuable contribution is given by the receiver operator characteristic (ROC) curve. The ROC curve represents ‘‘true positive’’ and ‘‘false positive’’ classification rates as a function of different classification cutoff values for the predicted probabilities resulting from the logistic regression. In literature, several indices of accuracy have been proposed to summarize ROC curves. In particular, the area under the curve (AUC) index is one of the most commonly used, see, for instance, Hosmer and Lemeshow (2000), Metz (1978) and Fawcett (2006) for a detailed explanation of the basic principles of ROC analysis. The AUC index is related to the Somers’ D by the following relationship: DYX = 2AUC − 1 (see Newson, 2002). In this paper we propose a new exploratory index to measure the predictive power of a logistic regression model. From a theoretical point of view, it is not a proper pseudo-R2 index and analogously to other exploratory methods mentioned before it is based on the differences between observed and predicted outcomes. This new index is based on the sensitivity of the estimated binary logistic regression model. The term sensitivity within this context implies the quality of the model to predict correctly the value of the dependent variable. Most statistical software packages provide, as a self-evaluation of the estimated model, the number of individuals in the sample that the model predicts correctly as a function of the critical values considered (the cutoff points). In other words, each cutoff point cp provides the percentage of sampled individuals observed with values of one that the estimated model predicts correctly by assigning P [Y = 1] > cp . These values, which decrease as cp increases, are the components of a vector we refer to as S1 , whose dimension is determined by the number of cutoff points cp1 , cp2 , . . . , cpn considered. Associated with the vector we have a vector X1 of the same dimension and of which the components are the number of sampled individuals observed with values of one that the model predicts accurately for each cutoff point cpi , but inaccurately for cpi+1 (i : 1, . . . , n − 1). Likewise, self-evaluation of the model provides the percentage of individuals of the sample observed with values of zero that the estimated model predicts correctly by assigning P [Y = 1] < cp . These values, which decrease as cp decreases, are the components of a vector we refer to as S0 . In this case, we consider the components of S0 in decreasing order. Associated with S0 , we have a vector X0 of which the components are the number of individuals of the sample observed with values of zero that the model predicts accurately for each cutoff point cpi , but inaccurately for cpi−1 . To illustrate, consider a model estimated using a sample of 20 individuals observed with the value one and 10 individuals observed with the value zero, and that we select deciles c0.1 , c0.2 , . . . , c0.9 as cutoff points. Let us assume that S1 and S0 are: S1 = (1, 1, 0.95, 0.9, 0.8, 0.8, 0.7, 0.55, 0.35);
S0 = (1, 1, 1, 0.9, 0.9, 0.9, 0.7, 0.6, 0.3).
The corresponding X1 and X0 vectors will then be: X1 = (0, 0, 1, 1, 2, 0, 2, 3, 4, 7);
X0 = (0, 0, 0, 1, 0, 0, 2, 1, 3, 3).
We may interpret the components of S1 and S0 , within a ROC curve context, in terms of sensitivity and specificity, respectively. However, even though we have this common starting point derived from the ROC curve, we show later on Q3 that there are substantial methodological differences. We intend to introduce the idea of sensitivity as applied to any vector X (xi ) ∈ R+ n . In the same way that there is a certain intuitive idea that the components of a vector X are ‘‘more nearly equal’’ than the components of another vector Y , we can talk about a certain intuitive idea that the components of a vector X present more sensitivity than the components of another vector Y . Focus again on vector X1 = (0, 0, 1, 1, 2, 0, 2, 3, 4, 7). We know that we can measure the inequality of the components of X1 via certain inequality measures, such as variance, where we consider the underlying dispersion, or the Gini index, where the vector components correspond to income distribution. These measurements and any others corresponding
H.M. Ramos et al. / Statistics and Probability Letters xx (xxxx) xxx–xxx
3
to the basic and intuitive idea of inequality, must follow Dalton’s (1920) transfer principle (also known as the Pigou–Dalton principle), which states: Given a vector X , if an amount ∆ is transferred from a component xi to a component xj , provided that 0 < ∆ ≤ (xi − xj ), inequality decreases. Let us focus again on vector X1 within a binary logistic regression context. The components of X1 will show the greater or lesser ability (sensitivity) of the model to predict accurately individuals in the sample observed with value one. In this case, the basic and intuitive idea of inequality crystallizes in the sensitivity concept. One of the objectives of this paper is then to delimit the sensitivity concept by setting reasonable principles for any statistic to meet to serve as a sensitivity measure. First, it should be scale transformation invariant. It is obvious that if we multiplied vector X1 by the same k > 0, the sensitivity vector S1 would remain the same. On the other hand, it is easy to imagine that vector X1′ = (0, 0, 1, 1, 2, 5, 2, 3, 4, 2), obtained when transferring into X1 five units from the tenth component to the sixth component, may be associated with a less sensitive model for the detection of values of one than the model corresponding to vector X1 . In this case, for cutoff point c0.6 , the model should predict accurately 80% of the values of one in the case of vector X1 , whereas it would only predict accurately 55% in the case of vector X1′ . In fact, whatever cutoff value that we set as the critical value, vector X1 has greater or the same sensitivity as X1′ . In this example, we have undertaken the transfer of a component of greater value to a component of a lesser value. However, contrary to the compatibility of the common inequality or variability measurements with Dalton’s transfer principle, with sensitivity we must set some other condition for the transfer of an amount from a greater component to a lesser component. This is because even though any vector component permutation would leave the vector inequality invariant, it is not the same as in the case of sensitivity. For instance, let us assume that in X1 we transfer one unit from the fifth component to the sixth component. Then, the resulting vector X1′ = (0, 0, 1, 1, 1, 1, 2, 3, 4, 7) has no less sensitivity, but in fact, sensitivity will be greater. Indeed, for cutoff point c0.5 , the model corresponding to X1 predicts accurately 80% of the values of one while the model corresponding to X1′ would predict 85% of the values of one. Whatever the cutoff point we take as the critical value, vector X1′ has a greater than or the same sensitivity as X1 . This is because with sensitivity we must consider not only the component values, but also the positions of the components involved in the transfer. This prompts us to propose a modification to the classic transfer principle. This approach serves as the starting point of our contribution. Our first objective in this paper is to formalize the idea of the sensitivity associated with a vectorX (xi ) ∈ R+ n . To achieve this, we establish the principles that any statistic should meet to be a sensitivity index. Once this general framework is established, we propose a new exploratory index to assess the goodness of fit of a binary logistic regression model. This new index draws on the sensitivity of the predictive ability of the estimated model. Finally, we will apply the results to several examples and compare the results with those obtained utilizing the classical McFadden pseudo-R2 and the AUC indices.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
2. Sensitivity index
33
Definition 1. Given a vector X (xi ) ∈ R+ n , the associated sensitivity vector SX (sj ) is defined as:
34
sj =
n xi i =j +1
T
(j : 1, . . . , n − 1), where T =
n
xi .
35
i=1
Definition 2. Let X (xi ) ∈ R+ n , and let SX (sj ) (j : 1, . . . , n − 1) be the associated sensitivity vector. We define sensitivity index R as: R(X ) =
n −1
sj
n−1 j=1
.
(1)
We easily prove that 0 ≤ R(X ) ≤ 1 reaches its minimum value when X = (T , 0, 0, . . . , 0) and maximum value when X = (0, 0, . . . , 0, T ). In what follows, we establish two reasonable principles that any sensitivity index associated with a vector X (xi ) ∈ R+ n must follow. 2.1. Criteria for sensitivity indices A sensitivity index must meet the following two principles. (i) Revised transfer principle: Given a vector X (xi ) ∈ R+ n , when an amount ∆ is transferred from a component xk to a component xh , with k > h, provided that 0 < ∆ ≤ xk , sensitivity decreases. (ii) Scale transformation invariance principle: For every X (xi ) ∈ R+ n and for every k > 0, X and kX have the same associated sensitivity vector, and therefore their sensitivity must be equal.
36 37
38
39 40 41
42
43 44 45 46 47 48 49
4
1
2 3 4
H.M. Ramos et al. / Statistics and Probability Letters xx (xxxx) xxx–xxx
Result 1. The sensitivity index R(X ) defined in (1) complies with the revised transfer principle. Dem. It is easy to see that the sensitivity vector SX ∗ resulting from transfer is dominated by the initial vector SX in the elementwise vector ordering, as all SX components are greater or equal to the corresponding components in SX ∗ . From this Q4 point, the proof follows easily.
5
Result 2. The sensitivity index R(X ) is invariant under a scale transformation.
6
The proof is trivial and has been omitted.
7
2.2. The R∗ index for evaluating a binary logistic regression model
8 9 10 11 12 13 14 15 16 17 18 19 20
21
Most statistical software packages auto-evaluate the estimated model using the ratios of the sampled individuals that the model predicts accurately, depending on the critical values (cutoff points) established. Generally, cutoff points range from zero to one in constant increments. Let us assume, for example, that we take deciles c0.1 , c0.2 , . . . , c0.9 . In this case, the software will provide two vectors of dimension 9 that we can identify immediately with sensitivity vectors S1 and S0 , respectively, associated with vectors X1 and X0 , of which the components are the observed absolute frequencies of the values of one and zero, respectively, in the sample that the model predicts accurately for each cutoff point considered between 0.1 and 0.9. Using the sensitivity index R defined in (1), we can measure sensitivity corresponding to vectors X1 and X0 . The measurements are denoted R1 and R0 , respectively. We consequently see that we can measure the sensitivity of the estimated model to predict the values of one correctly and the sensitivity to predict values of zero correctly. The greatest model sensitivity is when the components of the two sensitivity vectors are equal to one. This occurs when the estimated model assigned P [Y = 1] > 0.9 to all ones observed in the sample and P [Y = 1] < 0.1 to all zeros observed in the sample. Based on this, we propose an intuitive index based on the sensitivity of both vectors, X1 and X0 , denoted by R∗ , given by R∗ = R1 + R0 − 1.
(2)
30
It is evident that the greater is the sensitivity of the vectors X1 and X0 , the greater is the value of R∗ and this one corresponds to a better explanatory power of the estimated model. Of course, our index depends on the number of cutoff points. For instance, it will take different values if we consider percentiles or quartiles instead of deciles. This is a drawback that shares with other classical indices as the classical Hosmer–Lemeshow index based on the chi-squared test. However, due to the fact that we averaged empirical accumulated frequencies, it is apparent that it will be more precise when we increase the number of cutoff points which obviously depends on the data size. Hence, the optimal number of cutoff points is in concordance with the optimal number of non-overlapping intervals with equal size in order to construct a histogram. Finally, it is apparent that if we are interested in comparing different models, we must consider identical cutoff points in all cases.
31
Result 3. 0 ≤ R∗ ≤ 1.
22 23 24 25 26 27 28 29
32 33 34
35
Dem. Since R1 ≤ 1 and R0 ≤ 1, we have R∗ ≤ 1. We now prove that R∗ ≥ 0. As we have seen in the introduction section, we may interpret the components of S1 (s1j ) and S0 (s0j ), within a ROC curve context, in terms of sensitivity and specificity, respectively. Then we have s1j + s0j ≥ 1 for each cutoff point considered. From this, R∗ (X ) = R1 + R0 − 1 =
n −1 j=1
36 37 38 39 40 41 42 43 44 45 46 47
s1j n−1
+
n −1 j =1
s0j n−1
−1=
n −1 1 sj + s0j j=1
n−1
− 1 ≥ 0.
In conclusion, R∗ = R1 + R0 − 1 is based on the sensitivity associated with the vectors whose components are formed by the observed absolute frequencies of the values of one and zero in the sample that the model predicts accurately for each cutoff point, S1 and S0 , respectively. Namely, Ri is the average of the sensitivity vector Si , i = 0, 1. Alternatively, Ri can be interpreted as the area of the stepping curve associated with Si , i = 0, 1. Hence R∗ is the area between the stepping curves associated with S1 and 1 − S0 . The equality R∗ = 1 holds when all components of both sensitivity vectors S1 and S0 are equal to 1. In other words, it will occur when all estimated probabilities associated with fitted explanatory variables for successes and failures in the sample are greater and smaller than the last and the first cutoff point considered, respectively, which can be considered the ideal case. On the other hand, R∗ = 0 holds when the area between the stepping curves associated with S1 and 1 − S0 is equal to 0. In other words, when R1 = R0 = 1/2 which is equivalent to check that the distribution of successes and failures is uniformly in all intervals defined by the cutoff points. In such a case S1 and S0 take the following form 1 n−2 distributed 1 , , . . . , S1 = S0 = n− , which clearly corresponds with the worst explanatory power. n n n
H.M. Ramos et al. / Statistics and Probability Letters xx (xxxx) xxx–xxx
5
Table 1
3. Examples Q5
1
Table 1 provides values for three variables measured during the period 2010–2011 for OECD countries (source: OECD http://stats.oecd.org/). Using the data in Table 1, we consider a logistic regression model. The dependent variable is a binary variable that takes a value of one for European countries (24 countries in the sample) and a value of zero for non-European countries (10 countries in the sample). Predictors are: V1 (Ratio of women in Employment, 15 to 64 years); V2 (Old age social spending—percentage on Gross Domestic Product) and V3 (CO2 emissions —millions of tons). We employ Statgraphics Centurion XVII as the statistical software package. Software provides us the percentages of one and zero values in the sample that the model predicts accurately. We establish that the cutoff points will be 0.1, 0.2, . . . , 0.9. From the software output, we obtain the corresponding sensitivity vectors S1 and S0 , and from these we calculate the R∗ index defined in (2). Finally, we compare the R∗ value against the McFadden pseudo-R2 . The corresponding sensitivity vectors are: S1 = (1, 1, 1, 1, 0.9583, 0.9583, 0.875, 0.875, 0.7917);
S0 = ( 0.9, 0.9, 0.9, 0.9, 0.9, 0.8, 0.8, 0.8, 0.6).
The sensitivity indices defined in (1) are R1 = 0.9398 and R0 = 0.8333. Then R = R1 + R0 − 1 = 0.7731. ∗
∗
2 3 4 5 6 7 8 9 10 11 12 13 14
R2MF
We obtain a value R slightly higher than the McFadden index provided by the software. For this example and for 19 other cases, we have calculated the values for the R∗ , McFadden, and the AUC indices. Most of the cases are from http://www.umass.edu/statdata/statdata/stat-logistic.html. Table 3 lists the results. The examples discussed show that the values obtained for the new R∗ index are very close to McFadden index values. The correlation matrix (Table 4) obtained using the values in Table 3 indicates very large linear correlation coefficients between the R∗ index and the other two pseudo-R2 indices, especially so in the case of the McFadden pseudo-R2 . 4. Conclusion and further research In this work, we have introduced the idea of sensitivity associated with a X (xi ) ∈ R+ n vector. We have formalized this intuitive idea by establishing principles that statistics would need to satisfy in order for us to consider it a sensitivity index: namely, the revised transfer principle and scale invariant transformations. This formal framework allowed us to propose a new index for evaluating a binary logistic regression model. This new index is of a very simple construction that easily
15 16 17 18 19 20
21
22 23 24 25
6
H.M. Ramos et al. / Statistics and Probability Letters xx (xxxx) xxx–xxx Table 3 R∗
McFadden
AUC
R∗
McFadden
AUC
0.5194 0.7731 0.0611 0.4923 0.0962 0.5463 0.5185 0.7555 0.8426 0.4512
0.4932 0.7353 0.0552 0.4238 0.0856 0.4522 0.4403 0.7073 0.7942 0.3640
0.9125 0.9375 0.6333 0.9007 0.6840 0.9219 0.9219 0.9750 0.9833 0.8729
0.5073 0.7808 0.2996 0.0717 0.3356 0.3888 0.2780 0.2191 0.4673 0.1730
0.4030 0.7416 0.2416 0.0859 0.3485 0.3583 0.2145 0.1748 0.3888 0.1213
0.8873 0.9780 0.8193 0.7113 0.8851 0.8696 0.7987 0.7697 0.8908 0.7443
Table 4 R∗ R∗ McFadden AUC
0.9910 0.9459
McFadden
AUC
0.9910
0.9459 0.9287
0.9287
8
allows for implementation in any statistical software package and has no computational problems. The index broadly lies within the class of indices based on the differences between observed and predicted outcomes. In this sense it shares a similar starting point with the coefficient of discrimination proposed by Tjur (2009) and derived indices from the classical ROC curve. Finally, it is also worth to mention that the partial information given by R0 and R1 in the computation of the index is crucial in order to determine the effect of both false positives and false negatives. From a practical point of view, given two models with the same R∗ , it should be preferable the model with a higher R1 value. Finally, an important question for future research is to study in detail the relationship with other classical indices and the asymptotic properties based on different cutoff points.
9
Acknowledgments
1 2 3 4 5 6 7
12
The authors want to acknowledge the comments by the referees and the Editor of the journal which have improved significantly the presentation of this paper. Authors also acknowledge support received from the Ministerio de Economía y Competitividad (Spain) under grant MTM2014-57559-P.
13
References
10 11
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Cameron, A.C., Windmeijer, F.A.C., 1997. An R-squared measure of goodness of fit for some common nonlinear regression models. J. Econometrics 7, 329–342. Cox, D.R., Snell, E.J., 1989. The Analysis of Binary Data, second ed.. Chapman & Hall, London. Cragg, S.G., Uhler, R., 1970. The demand for automobiles. Canadian Journal of Economics 3, 386–406. Dalton, H., 1920. The measurement of the inequality of incomes. Econom. J. 30, 348–361. Fawcett, T., 2006. An introduction to ROC analysis. Pattern Recognit. Lett. 27, 861–874. Hosmer Jr., D.W., Lemeshow, S., 2000. Applied Logistic Regression, second ed.. John Wiley & Sons, New York. Hu, B., Shao, J., Palta, M., 2006. Pseudo-R2 in logistic regression model. Statist. Sinica 16, 847–860. Maddala, G.S., 1983. Limited-Dependent and Qualitative Variables in Econometrics. Cambridge University Press, Cambridge, UK. McFadden, D., 1974. Conditional logit analysis of qualitative choice behavior. In: Zarembka, P. (Ed.), Frontiers in Econometrics. Academic Press, New York, pp. 104–142. McKelvey, R.D., Zavoina, W., 1975. A statistical model for the analysis of ordinal level dependent variables. J. Math. Sociol. 4, 103–112. Metz, C.E., 1978. Basic principles of ROC analysis. Semin. Nucl. Med. 8 (4), 283–298. Nagelkerke, N., 1991. A note on a general definition of the coefficient of determination. Biometrika 78, 691–692. Newson, R., 2002. Parameters behind ‘‘non-parametric’’ statistics: Kendall’s tau, Somers’ D and median differences. Stata J. 2 (1), 45–64. Smith, T.J., McKenna, C.M., 2013. A comparison of logistic regression pseudo-R2 indices. Mult. Linear Regres. Viewp. 39 (2), 17–26. Somers, R.H., 1962. A new asymmetric measure of association for ordinal variables. Am. Sociol. Rev. 27, 799–811. Tjur, T., 2009. Coefficients of determination in logistic regression models — A new proposal: the coefficient of discrimination. Amer. Statist. 63 (4), 366–372. Windmeijer, F.A.G., 1995. Goodness-on-fit measures in binary choice models. Econometric Rev. 14, 101–116.