International Journal of Forecasting 28 (2012) 248–260
Contents lists available at SciVerse ScienceDirect
International Journal of Forecasting journal homepage: www.elsevier.com/locate/ijforecast
Estimating causal effects of credit decisions Gerald Fahner ∗ FICO, 901 Marquette Avenue, Suite 3200, Minneapolis, MN 55402, USA
article
info
Keywords: Causality Data mining Decision making Econometric models Finance Model selection Nonparametric methods Regression
abstract In principle, making credit decisions under uncertainty can be approached by estimating the potential future outcomes that will result from the various decision alternatives. In practice, estimation difficulties may arise as a result of selection bias and limited historic testing. We review some theoretical results and practical estimation tools from observation study design and causal modeling, and evaluate their relevance to credit decision problems. Building on these results and tools, we propose a novel approach for estimating potential outcomes for credit decisions with multiple alternatives based on matching on multiple propensity scores. We demonstrate the approach and discuss results for risk-based pricing and credit line increase problems. Among the strengths of our approach are its transparency about data support for the estimates and its ability to incorporate prior knowledge in the extrapolative inference of treatment-response curves. © 2011 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved.
1. Introduction Making credit decisions poses a classical problem of decision making under uncertainty: treatments for individuals must be selected on the basis of estimates of potential future outcomes resulting from treatment alternatives. A growing number of organizations have developed or aspire to develop generalizations of traditional scoring models for delivering such estimates for business objectives such as response, revenue, profit, prepayment, attrition, default and loss. Such models have been referred to as profit scoring models, action-effect models, or, in a wider sense, decision models (Marshall & Oliver, 1995; Rosenberger & Nash, 2009, p. 138; Thomas, 2009, p. 204). The models are developed to predict individuals’ potential future outcomes, based not only on characteristics describing individuals or accounts, as is the case with traditional scores, but also on the applicable treatments for decisions such as product offers, pricing, credit limits, authorizations and collection actions.
∗ Corresponding address: 4443 Stony Meadow Lane, Austin, TX 78731, USA. E-mail address:
[email protected].
The main problem is to estimate likely future outcomes for individuals or groups as functions of alternative treatments, conditional on the individual or group being held fixed. This problem is shared across research areas ranging from the medical to social science fields. A significant body of statistical and econometric work has been dedicated to the characterization of the theoretical properties of the problem, and the demonstration of the conditions under which accurate estimates can be found. Based on these properties, powerful tools for tackling the estimation problems have been developed. One basic problem concerns the estimation of the average effect of a binary treatment alternative on a single outcome of interest. For example, does an alternative style of collection letter nudge a larger share of past-due customers to make their payments? The basic problem can be refined to estimating the treatment effects for multiple treatment alternatives, and conditional on groups or even individuals. The plan of this paper is as follows: The next section discusses the potential outcomes framework which was first applied to the study of causation by Rubin (Holland, 1986), related practical estimation tools of propensity scoring and matching, and some extensions and applicability to credit operations. Section 3 outlines our approach to conceptually estimating the effects of credit decisions. Sections 4 and 5 give more details and present the case study results.
0169-2070/$ – see front matter © 2011 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved. doi:10.1016/j.ijforecast.2010.10.002
G. Fahner / International Journal of Forecasting 28 (2012) 248–260
249
• Common support (sometimes called overlap) requires that 0 < p (X ) ≡ Pr {S = 1|X } < 1; i.e., at each value of X , there is a nonzero probability of receiving each treatment. p (X ) is called the propensity score. Unconfoundedness implies: E Y k |X = x = E Y k |S = k, X = x
= E [Y |S = k, X = x] ; k ∈ {0, 1} . Define θ (x) ≡ E Y 1 − Y 0 |X = x , the average treatment effect for the subpopulation at X = x. From unconfoundedness, θ (x) = E [Y |S = 1, X = x] − E [Y |S = 0, X = x]. This expression can be estimated for any value of x Fig. 1. A customer receives a credit line increase. Later, we observe the customer’s account revenue. In a counterfactual world, the customer gets no increase. Boxes marked by dashed lines are unobserved. The diagram is not intended to imply that the treatment is selected based on the measured covariates only, although we will make this assumption later.
The style of our presentation is somewhat analogous to that of Rubin and Waterman (2006), while our work clearly extends beyond that of their article. 2. Problem formulation 2.1. Rubin causal model The simplest formulation concerns dichotomous treatments, referred to in the following as ‘control’ and ‘treatment’. We are given units (individuals, accounts) i = 1, . . . , N. For each unit, we posit a pair of potential outcomes, Yi0 , Yi1 , under control and treatment, respectively. Each unit receives either control or treatment, as indicated by the treatment indicator (Si = 0 if control, Si = 1 if treated). Yi is the observed outcome, such that Yi = Yi0 if Si = 0, and Yi = Yi1 if Si = 1. Only one potential outcome is observed: the other, called counterfactual, is unobserved. The units are characterized by their observed covariates Xi , which are variables which are not influenced by the treatment, and indeed are typically measured prior to the treatment. The unit-level treatment effect is defined as the difference between the potential outcomes, Y 1 − Y 0 . This is an unobservable random variable, because we never observe Y 0 and Y 1 together for the same unit. Fig. 1 illustrates these definitions for a credit treatment. Consider the problem of estimating the average (ex pected) treatment effect θ ≡ E Y 1 − Y 0 = E Y 1 −
E Y 0 . This is an expectation over an unobserved variable, and thus cannot be estimated directly. We could estimate E [Y |S = 1] − E [Y |S = 0], but this will not generally equal the average treatment effect, because the two treatment groups may not be comparable in their covariate distributions prior to the treatment, which can lead to a severe selection bias (Table 5 in Section 4 provides an example). However, it is possible to estimate the average treatment effect based on assumptions of unconfoundedness and common support.
• Unconfoundedness (sometimes called conditional independence or selection on observables) requires 0 1 Y , Y ⊥S |X ; i.e., the potential outcomes and treatment are conditionally independent given X .
with a nonzero and below one probability of receiving each treatment, which is granted by the common support condition. Given a small region in the covariate space around X = x with common support, we can, at least in principle, estimate the local average difference of the outcomes between the treated and very similar (matching) control units. As a consequence, θ = EX [θ (x)] can be estimated. Conditioning on X = x removes the bias due to observable covariates. In the above, common support was assumed globally. A weaker condition, which is still useful for estimating the effects of treatments, is for common support to only hold for a subset of the covariate space. 2.2. Matching on the propensity score For practical applications, we need to obtain estimates from finite samples. It can be difficult, if not impossible, to find units with very similar values for all covariates. Among the useful results derived by Rosenbaum and Rubin (1983) 0 1is the finding that, if unconfoundedness holds, then Y , Y ⊥S |p(X ) also holds, i.e. the potential outcomes and treatment are conditionally independent, given the propensity score. Thus, biases due to observable covariates can be removed by conditioning on the propensity score alone: E Y 1 − Y 0 |p (X = x) = E [Y |S = 1, p (X = x)]
− E [Y |S = 0, p (X = x)] . This reduces the dimension of the estimation problem from the high-dimensional covariate space to a single dimension, the propensity score. Given a small interval [p, p + 1p] of propensity score values, we can determine the local average difference in outcomes between the treated and control units with propensity scores values in that interval. Matching is a method of sampling units to generate control and treatment groups which are comparable in terms of their covariate distributions. Matched sampling based on the propensity score alone ensures that the matched treated and controls have the same probability distributions for the observed covariates. This makes matched sampling a useful nonparametric tool for adjusting for treatment vs. control group differences in all observed covariates, thus removing any selection bias due to observables, before comparing the outcomes from treated and controls. In practice, we seldom know the true propensity score; instead, we have to estimate it. Matching is then based on the estimated propensity score. The intuition
250
G. Fahner / International Journal of Forecasting 28 (2012) 248–260
behind matching on the estimated propensity score is as follows. Divide the score into small intervals. The units within a given interval then have approximately the same estimated odds of receiving the treatment rather than the control. The common support assumption ensures that we find some treated and some control units in each interval (assuming a large enough sample size). We then match them based on sharing the same interval. The matched treated units should then have averages of their covariate values which are very similar to those of the matched control units. If this were not the case (for example, if a certain covariate assumed systematically higher values among the matched treated group), then this covariate would provide additional information about the treatment odds over and above the propensity score estimate. The estimated propensity score could therefore be improved by modeling the relationship with this covariate more carefully, until the covariate averages between the matched treated and controls eventually become balanced (not significantly different to each other). When developing the estimated propensity score, it is important to assess its balancing properties and improve the score if necessary. The balance is especially important for low order moments, while balancing higher order moments is generally hard and balancing the full joint distribution is generally infeasible. For a more extensive discussion on balancing, see Section 2.4.2 on the plausibility of common support. 2.3. Multiple treatments The definition of treatment effects can be extended to multiple treatments (Imbens, 2000; Lechner, 2001; Rubin, 1997). Units receive treatments Ti ∈ {1, . . . , K } from a T discrete set, and we observe the outcomes Yi i associated with the treatments received. We do not observe the counterfactual outcomes Yik ; k ∈ {1, . . . , K }\ Ti . Analogous to our definitions for binary causal effects, the average and local treatment effects of treatment j relative to treatment k are defined as:
θjk ≡ E Y j − Y k θjk (x) ≡ E Y j − Y k |X = x .
(1)
The propensity score is defined for dichotomous treatment vs. control alternatives. More than two treatments can be handled by analyzing a set of dichotomous problems. For each dichotomous problem, we can ignore the units which do not belong to the dichotomy. Consider a problem with three treatments. The associated (control, treatment) dichotomies are (1, 2) , (1, 3) , (2, 3). To analyze (2, 3), we remove all units from the sample which receive treatments other than 2 or 3—here, all units receiving treatment 1. By defining a binary control vs. treatment indicator S = 0 if the treatment is 2 and S = 1 if the treatment is 3, we revert back to the Rubin causal model with the conventional control vs. treatment framework. We only need to consider the unordered dichotomies, since switching the definitions of treatment and control only changes the sign of the treatment effect. Given K treatments, there are K (K − 1) /2 unordered dichotomies.
2.4. Plausibility of assumptions The unconfoundedness and common support conditions hold if the treatments are assigned randomly and with nonzero probabilities depending only on the observed covariates. Since most business operations are not designed in this way, it can be difficult, if not impossible, to estimate the average treatment effect from business-asusual data. However, it may still be possible to estimate useful treatment effects for improving business operations, based on thoughtful data collection, a careful relaxation of assumptions, and appropriate modeling and estimation strategies, as well as by accompanying the conclusions with appropriate warnings concerning their assumptions and limitations. If the assumptions appear strenuous, any business decisions based on the modeling results should be validated carefully by domain experts and tested on a limited scale, before any major changes to an operation are made. We consider the plausibility of each assumption in turn. 2.4.1. Plausibility of unconfoundedness For applications where the treatments are assigned in an automated way based on the observed covariates, the unconfoundedness assumption can be met, at least in principle, provided that there is a high level of integrity in the data collection. The requirement is to store all of the variables used to assign the treatment and make them available for analysis. Increasingly, financial institutions base their treatment decisions on rules, automated systems that depend entirely on the observed covariates, possibly up to a random term from randomized testing. The covariates can include multiple data sources such as credit bureau data and scores, application information and origination scores, account transaction data and behavioral and transaction scores, and collection data and their scores. Powerful databases and robust decision management architectures increasingly ascertain that all of the variables used to select the treatments are available as covariates for model development. Where this is the case, the unconfoundedness condition can be met. For other applications, especially when human operators affect treatment decisions, the unconfoundedness assumption is generally not verifiable. If confounding variables that are jointly associated with the treatment and potential outcomes remain unobserved, then no estimation technique will be able to control for them, which can lead to biased effect estimates. In unclear situations, analysts should go to great lengths to collect as many potential confounders as possible, which should include as much as possible of the information which is considered relevant by the human expert when assigning treatments. A common reaction is to include in the covariate set any variable that one can get hold of which is not influenced by the treatment.1 The hope is that some of the many
1 There are situations where this is not the best advice, concerning variables which are guaranteed not to be associated with the potential outcomes, but are associated with the treatment. See the application in Section 5.
G. Fahner / International Journal of Forecasting 28 (2012) 248–260
covariates will be sufficiently correlated with the potentially unobserved confounders that controlling for the observed covariates will constitute a sufficient control for the unobserved confounders as well. Unconfoundedness is certainly a problematic assumption when credit officers use their personal judgments to assign treatments. Their decisions could take into account ‘‘hidden variables’’ which contain some additional information about potential outcomes, over and above the information contained in the observed and stored covariates. Personal impressions about customers could be one such example, to the extent that impressions are predictive of future performance. A potentially serious problem arises when important covariates, on which treatment decisions are based, are not maintained properly in the database. In what follows, we assume that unconfoundedness holds. This is a reasonable assumption for automated high volume consumer lending. 2.4.2. Plausibility of common support The common support condition is under threat for operations which assign treatments consistently as functions of the measured covariates and do not engage in randomized testing. Customer segmentations, rule-based decision systems, treatment selections based on scores, etc., can all cause the probabilities of receiving certain treatments to be exactly 0 or 1 across subsets of the covariate space. Consider a simple rule that assigns every customer whose income falls below a certain cutoff point to control, and everyone else to treatment. The propensity score (a function of income) is either 0 or 1, there is no common support. The estimation of causal effects may still be attempted, based on parametric modeling, and limited to a local area around the cutoff, where the researcher is confident in making a functional assumption on the extrapolation. Linear or low order extrapolations are generally regarded as locally defensible, as long as the treatment effects are only estimated for ‘‘borderline cases’’ (cases which are close to the decision boundaries) between the two treatment regimes. Alternatively, we can think of the modeling of the propensity score as being done using a somewhat smooth function of income that switches more gradually between 0 and 1 over a local area around the cutoff, thus mapping out a local region of ‘‘approximate’’ common support. While simple cutoff decision rules create obvious common support problems, businesses often employ far more complex decision rules, where the treatment becomes a high-order interaction function in several covariates. Such rules can create many convoluted ripples in the treatment assignment function, and thus many borderline cases. If we model the propensity score by a lower order function of X when the true propensity score is a high-order function of X , then approximate common support regions can emerge once again. The cost of low-order propensity score approximations is that certain higher order moments of X may not be well balanced between the treatment and control groups, even after controlling for the estimated propensity score. This could theoretically lead to biases, if the potential outcomes depend on such unbalanced higher-order functions of the covariates. This is less of a concern when it is plausible
251
to assume that the potential outcomes depend on X through a reasonably smooth, lower-order function. Then, balancing the lower-order moments is the most important task for obtaining bias-free treatment effect estimates, and balancing complex interactions is less important. It is generally not feasible to balance all of the covariates and their interactions simultaneously, and hence judgment must be used (Rubin, 2006, p. 462).2 The situation can be improved greatly for businesses that perform randomized testing. In practice, cost considerations often constrain experiments to certain regions. Another reason for this is that the experiments may be designed, not to estimate treatment effects, but rather to compare alternative decision rules, as in the so-called champion/challenger testing of credit decision strategies. Thus, experimental test regions can be just as irregular as the aforementioned assignment rules, rather than filling the space completely. As a consequence, the common support condition may not be met globally, but may only be met across certain subsets of the covariate space crossed with the treatment space that ‘‘see’’ experiments. In the following, we will not assume that common support holds globally. 2.5. Implications of the common support problem for estimation Lechner (2000) discusses the common support problem for applied evaluation studies and its implications for the nonparametric estimation of the average treatment effects for populations of interest. For situations where the common support condition fails, he describes conditions under which nonparametric bounds can be derived for the desired effect estimate. The bounds depend on the configuration of the common support region, the outcome distribution, and information about observed outcomes outside of common support. Research studies that simply discard observations outside the common support region ignore this information, and are in a sense ‘‘moving the goalposts’’ (Crump, Hotz, Imbens, & Mitnik, 2006), as the trimmed sample may no longer be representative of the original population of research interest. Rather than trimming the sample informally, Crump et al. (2006) determine the optimal subsample for which the average treatment effect can be estimated most precisely, and apply the method to the evaluation of the average treatment effects for a job training program. Our approach to the common support problem differs from the methods discussed by Crump et al. (2006) and Lechner (2000). We also note that our study goals and interpretation are different: unlike for evaluation studies, credit decision makers hardly ever specify a priori a population of interest for which they wish to estimate the average treatment effect. Instead, their aim is to improve
2 In this author’s experience, outcomes from credit decisions can often be modeled accurately by additive functions of the covariates, especially when large numbers of covariates are available for modeling and when some covariates already represent transformations of raw variables that model the interactions between raw variables.
252
G. Fahner / International Journal of Forecasting 28 (2012) 248–260
the operation, by identifying possible profit improvements at the unit level wherever such opportunities may exist. Our estimation approach, as is described in more detail below, has two phases. Phase I ignores the information about outcomes outside of common support and estimates unit-level treatment effects and expected outcomes under alternative treatments, but only for subsets of the covariate space with common support. This can create opportunities for improving on credit decisions, though they are limited to these subsets. Phase II uses the estimates generated by phase I, as well as information about outcomes outside of common support, to develop a global parametric model that is used to predict unit-level expected outcomes under alternative treatments outside of common support. Thus, the global predictions can identify additional opportunities of improving credit decisions. The failure of common support poses a different type of problem for the parametric global model. While trimming the sample is avoided and information other than common support is used, one needs to be careful about the model’s functional assumptions. These rely to some extent on business knowledge and judgment. Prior knowledge, where it exists, can and should be used to guide the identification of further improvement opportunities outside of common support. However, since judgment can be biased, any conclusions drawn from the global model should be tested carefully before any major changes to the operation are made. 3. Estimation process We outline the process conceptually and defer the modeling details to the application section. We arrange the outcomes into a unit-by-treatment table (see Table 1), with cells for both the observed and counterfactual outcomes. We wish to have estimates for all cells, both for new units and for new treatments. Table 1 Unit-by-treatment table for the data. Each unit’s observed outcome is entered into the column for the received treatment. Dots indicate counterfactual outcomes. Units
1 2 3 4
Treatments 1
2
3
. Y21 . .
. . . Y42
Y13 . Y33 .
For each treatment, a model Yˆ k (X ) , k = 1, . . . , K , is developed for the regression of Y k on X among the units that receive treatment k, that is, E Y k |T = k, X . The intuition for separate regressions by treatment group is that the groups may come from different populations that behave differently. The smoothed outcome estimates are given in Table 2. In order to avoid extrapolation risk, we do not attempt to estimate the counterfactual outcomes using the above models. Instead, we approach the problem of inferring
Table 2 Unit-by-treatment table, with smoothed estimates for the observed outcomes. Units
1 2 3 4
Treatments 1
2
3
. Yˆ21 . .
. . . Yˆ42
Yˆ13 . Yˆ33 .
Table 3 Unit-by-treatment table, augmented with counterfactual outcome estimates. The lightly shaded cells are inferred by means of local causal effect estimates. Units
Treatments 1
2
3
1
.
.
Yˆ13
2
Yˆ21
Yˆ22
.
3
Yˆ31
Yˆ32
Yˆ33
4
.
Yˆ42
Yˆ43
counterfactual outcomes by first identifying common support regions for dichotomous treatment comparisons. Section 4 discusses how common support can be determined. Where we find common support, we develop local models for estimating treatment effects across the common support. This mitigates the extrapolation risk. We will use Eq. (1) for counterfactual outcome inferences. Consider the dichotomy (j, k) and assume that it has some common support (if not, no counterfactuals will be inferred). Ignore units that receive other treatments. We estimate θˆjk (Xi ) for units in the common support set only. Section 4 discusses how these estimates can be obtained. Assume that unit i belongs to the common support set. If the unit receives treatment k, we infer its counterfactual j outcome under treatment j as Yˆi = Yˆik + θˆjk (Xi ). Otherwise, if it receives treatment j, we infer its counterfactual j outcome under treatment k as Yˆik = Yˆi − θˆjk (Xi ). Any given unit in the sample may be a member of none, one, or several common support sets between the treatment it receives and any counterfactual treatments. Accordingly, for any given unit, none, one, or several counterfactual outcomes may be estimated. We illustrate a few possibilities by way of example: assume that unit 1 is from a region in the covariate space where the only treatment observed is k = 3. There are no units receiving other treatments located near unit 1, i.e. the unit is not part of any common support region. As a consequence, no counterfactual outcomes are estimated for this unit. Unit 2 is from a region where both treatments k = 1, 2 are represented. We estimate the counterfactual outcome Yˆ22 . Unit 3 is from a region where all of the treatments, k = 1, 2, 3, are represented. We estimate both counterfactual outcomes for this unit. We fill in the counterfactual outcome estimates in Table 3.
ˆ
Finally, a global model Yˆ (k, X ) is developed. This models the regression of the previously estimated outcomes Yˆ k , including the counterfactual outcome estimates, on both X
G. Fahner / International Journal of Forecasting 28 (2012) 248–260 Table 4 Unit-by-treatment table, completed with counterfactual outcome estimates for all cells. The darker shaded cells represent extrapolations. Units
Table 5 Comparison of the covariates and outcomes in the observed sample across the first 3 treatment groups. Treatment (k)
Treatments 1
2
ˆ1
1
ˆ2
ˆ3
Yˆ 1
Yˆ 1
1
2
ˆ3 Yˆ 2
2 3
Yˆ 3
4
Yˆ 4
ˆ Yˆ 2
ˆ1
ˆ2
ˆ3
Yˆ 3
ˆ1
3
Yˆ 1
ˆ Yˆ 2
253
Yˆ 3
ˆ2
ˆ3
Yˆ 4
Yˆ 4
and k, that is, E Yˆ k |k, X . The double hat notation indicates second-tier estimates, obtained by fitting through previous estimates. We fill all of the cells with the global estimates (see Table 4). This model allows outcomes to be predicted for new units. For a unit with covariates X = x, the estimated
ˆ
outcomes for all treatments are given by Yˆ (k, x) ; k = 1, . . . , K . It is also possible to predict the outcomes for new treatments, subject to some assumptions. One such assumption concerns ordinal treatments. In this case we may be willing to assume a smooth relationship between the treatment and outcome, replace the treatment index with the treatment value, and fit a regression function of
ˆ
the form Yˆ (Treatment Value, X ). In summary, we develop a global model in two phases. Phase I smoothes the data and augments them with counterfactual outcome estimates, to the extent that there is common support, relying on interpolation only. Phase II fits a global model to the pre-smoothed, augmented data. To the extent that we can estimate counterfactual cells, the global model selection and estimation should be less prone to extrapolation problems, relative to developing a model using the observed data. Pre-smoothing may also facilitate global model selection, because both the variance and the influence of outliers are reduced. The phased model development process indicates which cells represent extrapolations. A conservative user of the model may choose to ignore the predictions for extrapolated cells and base their future treatment decisions on the interpolated results only. 4. Risk-based pricing application Some lenders apply risk-based pricing to personal installment loans. Interest rate offers are affected by the would-be borrower’s risk-related characteristics, and perhaps also related criteria such as the loan purpose or origination channel. Approved applicants can decide to either take up the loan at the offered interest rate or walk away. Risk-based pricing affects several business objectives, including volumes, revenues, profits and losses. Determining the best offers for a population of applicants is often done based on judgmental segmentation schemes, although an increasing number of businesses now approach this problem empirically. One such approach is to model the applicants’ expected outcomes on these business dimensions, as functions of potential price treatments.
Application risk score Applicant age Channel: supermarket internet marketing Residence: owns rents other . . . other covariates Take rate (Y¯ )
1
2
3
172 47 30% 16% 54% 86% 11% 3%
140 42 36% 20% 44% 68% 29% 3%
134 37 52% 1% 47% 51% 45% 4%
···
···
···
68%
37%
44%
The treatment definition deserves attention. Some institutions employ tiered approaches to pricing, where a base interest rate is first calculated, possibly subject to loan characteristics and market conditions. This is followed by the decision of a risk-based interest premium, that is, how many percentage points of interest are added to the base rate. In this study, we are concerned with the effect of the interest rate premium (referred to as the ‘‘premium’’ from here on). We treat the base rate as a covariate which is known prior to the treatment decision. Here, we focus on estimating the willingness of applicants to take the loan at any given treatment value; that is, we seek to estimate individual price-response functions. Intuitively, we expect negative treatment effects, as higher interest rates could only dissuade an applicant from taking up the loan. We also anticipate that different individuals will have different price sensitivities. Credit constrained or very loyal individuals may be willing to borrow at higher prices, while unconstrained and credit savvy consumers may walk away if the price is not competitive. 4.1. Data We have N = 50,000 loan applications for approved applicants, characterized by 50+ continuous and categorical covariates measured before the treatment decision, including: applicant credit bureau variables, risk scores, demographics, origination channel, loan details and base rate. The outcome variable Y ∈ {0, 1} indicates the wouldbe borrowers’ refuse/accept responses. The treatment variable (Premium) assumes integer percentage values from 0% to 13%. Not all treatment values in this range are observed, and the distribution is skewed towards lower values. We limit our analysis to 5 treatment groups containing more than 2000 applicants each. We index the ordered treatment values by k = 1, 2, . . . , 5, such that Premium(k) < Premium(k + 1). Table 5 summarizes the covariate and outcome means and fractions for the first 3 treatment groups. As would be expected from risk-based pricing, higher treatment values are associated with lower risk scores (lower creditworthiness). The distributions of age, origination channel, residential status, and many other covariates also differ substantially across the treatment groups. The take rate is the highest for the lowest treatment value.
254
G. Fahner / International Journal of Forecasting 28 (2012) 248–260
The casual observer could interpret this as evidence that the applicant population is highly sensitive to price premiums, but this interpretation is premature and could be misleading. As was noted earlier, direct comparisons of the outcomes between treatment groups can be biased, due to group differences. The fact that the take rate actually increases with the treatment value for the (2, 3) treatment dichotomy makes this problem obvious. 4.2. Common support Our 5 treatment groups generate 10 treatment dichotomies for which we develop propensity scores. We model the associated propensity scores as generalized additive models in the form of scorecards (Thomas, 2009). We include all measured covariates in the models. Categorical predictors are expanded into binary indicator variables for the categories, while numeric covariates are binned into 10–20 quantiles. This technique allows the nonlinear relationships between the numeric predictors and the score to be modeled. Our model specification results in many degrees of freedom, and thus we choose to estimate the model parameters by maximizing the penalized likelihood, instead of by maximum likelihood estimation. This technique mitigates the problem of over-fitting to the noise, and often results in more accurate regression estimates than maximum likelihood estimation (Harrell, 2001). Fig. 1 shows the propensity score distributions for all treatment dichotomies. The plot provides a quick overview of the common support situation. None of the treatment dichotomies enjoys global common support. The common support situation is typically such that we find only controls in the lowest score ranges, and only treated in the highest score ranges. In the intermediate score ranges, we find both controls and treated. The highest treatment group in the rightmost plot column largely lacks common support (Fig. 2). We determine common support as follows: define the lower inner trim value as the value for which the empirical cumulative propensity score distribution of the treatment group exceeds 1%, and the upper inner trim value as the value for which the empirical cumulative propensity score distribution of the control group reaches 99%. Units with propensity score values in the inner trim range are flagged as common support units. 4.3. Matched sampling results Here we discuss matched sampling based on the propensity score and illustrate the benefits of matched sample analysis for treatment effect estimation. For all treatment dichotomies with some common support, we sample pairs of treated and control units matched on the propensity score, using a variant of a greedy pair-wise matching algorithm (Rubin, 2006, p. 212). We initialize a reservoir set with common support units and an empty matched set. Assume that there are fewer treated units in the reservoir than controls, otherwise reverse the roles of treated and controls. We visit the treated units in the reservoir in a random order. For each
Table 6 Summary statistics for the matched sample for the (2, 3) treatment dichotomy. Treatment (k)
Application risk score Applicant age Channel: supermarket internet marketing . . . other covariates Take rate (Y¯ )
2
3
137 40 50% 1% 49%
137 39 51% 1% 48%
···
···
41%
38%
treated, we look up its propensity score value and find the control in the reservoir that has the closest propensity score value. If the propensity score difference between the treated and the closest control is below a certain threshold, called a caliper, we add the treated and its closest control to the matched data set and remove both units from the reservoir. The caliper size controls the closeness and the number of matches; the closer the matches we seek, the fewer matches we find. Larger caliper sizes generate more matches, which reduces the variance of the matched sample-based estimates, but the control is weaker, which may increase the bias of these estimates. Below, we show the matched sample results for the (2, 3) treatment dichotomy. Approximately 13,000 units received these two treatments. With a caliper size of 0.0001, we obtain approximately 3500 matched units.3 The matched treated and control pairs have very similar covariate means, not just for the covariates shown, but for all covariates. In contrast to the results from Table 5, remarkably, the take rate now decreases with the treatment value; the treatment effect has the expected sign after adjusting for all observed covariates. Recall that we have estimated not the global average treatment effect, but the average treatment effect in the matched sample. Still, the insights gained from the matched sample analysis improve our understanding of price sensitivity (Table 6). As was discussed earlier, differences in the higher order moments of the covariate distributions between the treatment groups could lead to biases when comparing the matched sample outcomes, if the potential outcomes depend on the covariates in a nonlinear fashion. We cannot ignore this possibility. It is therefore recommended that not only covariate means, but also low-order moments of the covariate distributions, such as their variances, be compared. Grouped histogram plots can serve to compare the univariate distributions between treated and controls (Fig. 3). Diagnosing distribution differences for all covariates and all treatment dichotomies can reveal the need for the improvement of propensity scores and matching parameters. What constitutes a good propensity score can be judged by the closeness of the matched distributions for all
3 We tested various caliper sizes and made a judgmental selection based on good balancing properties and the size of the matched sample. The effect estimation results turn out not to be very sensitive to the choice of caliper size, within a reasonable range of caliper sizes.
G. Fahner / International Journal of Forecasting 28 (2012) 248–260
255
Fig. 2. Conditional propensity score distributions for all treatment dichotomies. The scores are scaled in ln(Odds). The vertical axes span the score range [−13, 13], and scores outside this range are truncated. The plot legends indicate the treatment group (top) and control group (bottom). The bars represent probability masses for the quantiles of the truncated score distribution. The bars on the left and right are for the control and treatment group distributions, respectively.
4.4. Individual and group treatment effects
Fig. 3. Comparing the distributions of the ‘Requested limit’ for the (1, 3) treatment dichotomy.
covariates, while a large number of matches is also desirable. In our experience, matching using propensity scores in the form of scorecards tends to outperform matching using ordinary logistic regression propensity scores, in terms of leading to more similar distributions, while the size of the matched samples tends to be smaller. This is understandable, because scorecards generally achieve a better separation between groups due to their more flexible modeling of nonlinear relationships. Similar to the choice of the caliper size, the choice of a model for the propensity score affects the tradeoff between the bias and the variance. We recommend sensitivity analysis for gaining an appreciation of the possible impact of modeling decisions.
Refined estimates of the treatment effects can be obtained based on the matched samples. A straightforward approach that works for larger subpopulations of the applicant population is to divide the above matched sample analysis into subsets by subpopulations of interest, such as risk groups or age groups. We then compare the empirical take rates between the matched treated and control pairs by subpopulation. The results may reveal different price sensitivities between subpopulations. We recall once again that, strictly speaking, such results hold only for the matched sample, and may not be representative of the entire applicant population. However, after careful deliberation, one may regard the insights gained as being representative of the applicant population, at least directionally. The principal problem with subset analysis is that it quickly runs out of data if we seek estimates for very specific subpopulations or individuals. Such requirements call for the relationships between the covariates, treatment, and outcome to be modeled. Given suitable data, regression modeling of the outcome as a function of the covariates and treatment can be a powerful discovery tool. Regression models can be structured flexibly to include interaction terms between the treatment variable and covariates. This allows the effect of the treatment on the outcome to depend on the values of the covariates. We stress the importance of suitable data: regression analysis can be unreliable for the purpose of estimating causal treatment effects from observational data (Gelman & Hill, 2007, p. 184; Rubin, 1997), due to hidden extrapolation problems and the sensitivity of the results to untestable assumptions about the form of the extrapolation. One approach that has been recommended is to perform regression analysis on matched samples (Rubin & Waterman,
256
G. Fahner / International Journal of Forecasting 28 (2012) 248–260
2006). The properties of the matched sample mitigate extrapolation problems. Let M0 and M1 be the control and treated units in the matched sample, respectively. Define:
ρl (X ) ≡ E [Y |X , Units in Ml ] = Pr {Take|X , Units in Ml } ;
l ∈ {0, 1}.
The treatment effect at X = x is θ (x) = ρ1 (x) − ρ0 (x), which follows from unconfoundeness and common support. We estimate this expression by developing separate scorecards for ρˆ 0 (x) based on the units in M0 , and ρˆ 1 (x) based on the units in M1 , using penalized maximum likelihood estimation. Individual-level treatment effect estimates for matched or common support units are given ˆ Xi ) = ρˆ 1 (Xi ) − ρˆ 0 (Xi ). by θ( For diagnostic purposes and to gain further insights into price sensitivity, we aggregate the individual-level effect estimates in order to generate subpopulationlevel estimates (means, prediction intervals) for a few subpopulations of interest. Fig. 4 illustrates the treatment effect estimates for subpopulations based on age and income quintiles of the covariate distributions in the matched sample. The matched sample for the (1, 2) treatment dichotomy was used for this analysis, and the results are conditional on this matched sample. A higher income is associated with an increased price sensitivity. Even after controlling for income, the price sensitivity is still very distributed heterogeneously within the income segments. Age and price sensitivity do not appear to be strongly associated in this sample. More insights were gained for other subpopulations of interest; for example, higher loan amounts, higher risk scores, and internet or supermarket marketing channels are associated with a higher price sensitivity. Following the process outlined in Section 3, we augment the data with inferred counterfactual outcomes for the common support units, based on our estimates for the individual-level causal effects. 4.5. Modeling the premium-response function When developing a global model for the premiumresponse function, it is plausible to assume that the pre-
ˆ
dictions Yˆ (k, X ) should depend on the treatment value in the form of somewhat smooth, monotonically decreasing curves, due to the ordinal nature of the treatment variable. We can achieve this by developing a model of the
ˆ
form Yˆ (Premium, X ), where the predictions depend on the premium by means of a cubic B-spline expansion. This avoids strong functional assumptions such as linearity or constant elasticity. It also has the technical advantage that monotonicity can be imposed through constraints on the B-spline expansion coefficients. Imposing monotonicity can reduce the variation and improve the extrapolation behavior, relative to unrestricted polynomial regression, which can lead to counterintuitive extrapolations. As we have demonstrated, the price sensitivity is distributed heterogeneously across applicants. We accommodate this behavior by allowing for interaction terms between the
Fig. 4. Estimated treatment effects (changes in take probability) when moving from treatment 1 to treatment 2, by income and age. The vertical bars indicate 95% prediction intervals.
premium and the covariates. The model is fitted using penalized maximum likelihood, subject to constraints on the B-spline expansion coefficients. For testing purposes, the model can be constrained to a linear premium-response relationship. We use all covariates as predictors in their binned representations, thus allowing the nonlinear relationships between the covariates and predictions to be modeled. Interaction terms between the covariates and the premium are tested and added to the model by a domain expert in a forward-selecting manner, which allows additional judgment to enter the modeling process. Our results confirm the nonlinear premium-response relationship and the existence of significant premiumcovariate interactions. The selection of interaction terms by the domain expert largely coincided with the selection of the terms that most improved the RMSE measure of fit quality—the most predictive interactions tended to make business sense. To gain insight into the global model fit, we again consider subpopulations of interest. For each subpopulation,
ˆ we estimate Yˆ (Premium, Xi∈ subpopulation ) over the interval Premium ∈ [0, 15%] in steps of 1%, which includes
G. Fahner / International Journal of Forecasting 28 (2012) 248–260
257
incomes are associated with a higher price sensitivity. This is also to be expected, as these individuals are likely to be less credit constrained. 5. Credit line increase application
Fig. 5. Incremental response curves for subpopulations. The solid lines indicate quintiles defined by the income and risk score, which are enumerated from the lowest to the highest quintile. The dotted lines indicate the average across the applicant sample. The plotted curves appear in exactly the same order as in the legends (the flattest curves are for Risk 1 and Income 1 (i.e., for the lowest risk score and income quintiles), and the steepest curves are for Risk 5 and Income 5 (i.e., for the highest risk score and income quintiles)).
both the treatment values found in the data and other treatment values inside and outside the observed treatment range. We define the premium-response function for a subpopulation as the average over the individuallevel premium-response functions in that subpopulation. Denote the premium-response function for subpopula-
ˆ
tion s by Yˆ (Premium, s). To compare the behavior across different subpopulations, we plot incremental response
Lenders periodically review credit bureau information and credit card spending and payment behavior in order to decide on potential credit line adjustments. Intuitively, the limits for risky, unprofitable accounts may be decreased, while those for accounts in good standing and with growth potential may be increased. A related question concerns the adjustment amount, if any. Many institutions employ automated account management systems that decide on line increases based on rules. The decision rule can be very complex and can depend on many variables, including customer loyalty data, credit bureau information, the scores for risk, revenue and attrition, spending patterns, payment behavior, closeness to current credit limit, and trends in any of these variables. Many organizations also test rule alternatives, also known as championchallenger testing (Rosenberger & Nash, 2009, p. 91). Credit line changes affect various important business outcomes, including revenue, costs, spending behavior, attrition, default likelihood and losses. Here, we study how card balances are affected by line increases and estimate the balance-response function. Intuitively, we expect nonnegative treatment effects, as higher lines enable higher balances, and there is little reason to suspect that higher lines reduce balances. We also expect the responses to differ between individuals. Severely credit constrained individuals will move to the new limit, others may respond in a muted manner or may not change their card usage at all. 5.1. Data We have N = 200,000 accounts which tend to revolve most of their balances from month to month, and 150+ continuous and categorical covariates measured before the treatment decision, characterizing the customers and their account behaviors. Many variables are derived from raw data fields, such as financial ratios, smoothed time series averages and rates of spending changes. The treatment variable (Increase) assumes discrete values from 0 to $4000 that typically come in multiples of $500. The covariate distributions differ substantially between the treatment groups. As an outcome variable, we use the card balance 12 months after the treatment decision.
ˆ
curves of the form ∆(Premium, s) ≡ Yˆ (Premium, s) −
ˆ
Yˆ (Premium = 0, s) as functions of Premium for different values of s in a chart. The incremental response curves intersect at the origin, which facilitates a visual comparison. Fig. 5 compares the subpopulations defined by the income and credit bureau risk score quintiles. The take rates decrease more rapidly for low values of the premium and more slowly for higher premiums. Higher risk scores (more creditworthy applicants) are associated with a higher price sensitivity. This is to be expected, as these groups attract good competitor offers. Similarly, higher
5.2. Common support We limit our analysis to 8 treatment groups, containing more than 4000 accounts each. We index the ordered treatment values by k = 1, 2, . . . , 8, such that Increase (k) < Increase (k + 1). This generates 28 treatment dichotomies for which we develop propensity scores. We model the associated propensity scores as generalized additive models in the form of scorecards. We include all covariates in the propensity scores, with one notable exception. The excluded variable is a random number that was generated
258
G. Fahner / International Journal of Forecasting 28 (2012) 248–260
Fig. 6. Conditional propensity score distributions for all treatment dichotomies. The scores are scaled in ln(Odds). The vertical axes span the score range [−13, 13], and any scores outside this range are truncated. The plot legends indicate the treatment group (top) and control group (bottom). The bars represent probability masses for the quantiles of the truncated score distribution. The bars on the left and right are for the control and treatment group distributions, respectively.
to conduct a randomized champion-challenger test, comparing alternative line increase rules. While this variable is associated with the treatment, we can safely ignore it for the purposes of matching and estimating treatment effects, because it is, by the construction of the randomized test, unrelated to the potential outcomes. Fig. 6 shows the propensity score distributions for all treatment dichotomies. The results indicate that treatment group 1 (Increase = 0, first row) enjoys limited common support with the other treatment groups. Treatment group 2 (second row) lacks common support with most other groups. Common support tends to be much reduced or nonexistent for very dissimilar treatments, as one might expect. Some dichotomies enjoy larger common support, such as (control, treatment) = (6, 7), (7, 8). The random variable is particularly predictive for these dichotomies, but less predictive for other dichotomies. This constitutes evidence that the champion-challenger test was focused mostly on aspects of the decision rules that concern the assignment to higher treatment values. 5.3. Matched sampling results For all treatment dichotomies with some common support, we perform matching based on the propensity score. We obtain balanced covariate distributions between the matched treated and control pairs. 5.4. Balance-response function Fig. 7 shows the incremental balance response curves for subpopulations which are distinguished by risk and
historic card usage. Generally, the balance increases almost linearly with the increase amount, while the fit is based on a cubic B-spline expansion and is not restricted to linearity. Customers with lower bankruptcy scores (who are more likely to experience bankruptcy) and accounts with balances which are close to the old limit respond most strongly. This is to be expected, as individuals with lower bankruptcy scores are more likely to be credit constrained. A large fraction of accounts with balances which are near the historic limits may also be constrained from borrowing on their card as much as they would like. When they are treated with a line increase, they make use of it to increase their balance. Customers with high bankruptcy scores and balances far below the historic limits show much less of a response. 6. Discussion Making credit decisions for individuals or accounts can be approached on the basis of a mathematical decision model relating the potential treatments, covariates, outcomes and business objectives. A key task when modeling decisions empirically is estimating the potential outcomes under potential treatments. The Rubin causal model connects potential outcomes to causal treatment effects and states the conditions for estimating these effects. We reviewed two key conditions under which such estimates can be obtained, namely unconfoundedness and common support. The tools of propensity scoring and matched sampling fit into this framework and facilitate causal effect estimation. Many financial institutions automate their origination and account management decisions based on rules involving only observed covariates. They store all variables so
G. Fahner / International Journal of Forecasting 28 (2012) 248–260
Fig. 7. Incremental balance response curves for subpopulations. The solid lines indicate quintiles defined by the risk score and cushion (the unused amount between the historic limit and the historic balance), which are enumerated from the lowest to the highest quintile. The dotted lines indicate the average across the sample. The plotted curves appear in exactly the same order as in the legends (the steepest curves are for Risk 1 and Cushion 1 (i.e., for the lowest risk score and cushion quintiles), and the flattest curves are for Risk 5 and Cushion 5 (i.e., for the highest risk score and cushion quintiles)). A close inspection reveals a slight concavity of the relationship.
that the unconfoundedness condition can be met. However, as a consequence of fixed decision rules, common support problems can emerge because groups receiving different treatments are separated. Common support may thus be reduced to small local regions of ‘‘borderline cases’’ close to decision boundaries. A lack of common support presents a formidable challenge for model development. Stronger assumptions are required for generating causal effect estimates. We have argued that complex rules where the treatment is a high-order function of many covariates can effectively enlarge the common support regions, if we approximate the propensity score by a low-order function of the covariates, and also that this enlargement will not bias the estimates, subject to reasonable assumptions. The common support condition can be guaranteed globally by exhaustive testing. However, businesses are typically constrained in their ability to conduct tests, leading to data limitations for potential outcome estimation. A common practice of account management is championchallenger testing, where accounts are randomly assigned to carefully crafted decision rule alternatives. The tests are
259
designed to compare the rule performance, not to estimate the treatment effects. The tests tend to generate randomized treatments across irregularly shaped subsets of the covariate and treatment space, resulting in irregularly shaped common support regions. We have presented a new approach for estimating the effects of credit decisions, based on theories and tools from causal modeling. Our approach deals with multiple ordinal or categorical treatment values, and captures heterogeneous treatment effects. It addresses common data limitations in the development of empirical decision models. Essentially, the approach consists of (i) identifying irregularly shaped common support regions, (ii) extracting local information about treatment effects from the support regions, and (iii) consolidating the local estimates into a global model for treatment-response behavior. The global model is useful for interpolative or extrapolative predictions and can incorporate prior knowledge. We obtained plausible results for both an origination and an account management problem. Our findings show a direction for improvement for businesses which are intent on developing accurate, empirically derived decision models. The current testing practices (or lack thereof) can be extended to address the common support problem more effectively than champion-challenger testing. Improvements to common support will improve the accuracy of the models, reduce the reliance on assumptions, and ultimately lead to more profitable decisions. Some decision model components can be sensitive to external changes. For example, the introduction of a new competing product or marketing campaign, or price competition, can change an offer-response function. This calls for frequent updates of models. We see a potential for our approach to be automated, such that the models can be updated frequently. Ongoing challenges for modeling credit decisions include large treatment spaces and sequential treatments. The Rubin causal model, with its cross-sectional design, represents a useful abstraction of a sometimes more complex reality, and this abstraction is expected to be of ongoing guidance for further developments. References Crump, R. K., Hotz, V. J., Imbens, G. W., & Mitnik, O. A. (2006). Moving the goalposts: addressing limited overlap in estimation of average treatment effects by changing the estimand. IZA Discussion papers 2347. Institute for the Study of Labor. IZA. Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. New York: Cambridge University Press. Harrell, F. E. (2001). Regression modeling strategies. New York: SpringerVerlag. Holland, P. (1986). Statistics and causal inference. Journal of the American Statistical Association, 81(396), 945–960. Imbens, G. W. (2000). The role of the propensity score in estimating doseresponse functions. Biometrika, 83, 706–710. Lechner, M. (2000). A note on the common support problem in applied evaluation studies. Discussion paper no. 2001-01. University of St. Gallen. Lechner, M. (2001). Identification and estimation of causal effects of multiple treatments under the conditional independence assumption. In M. Lechner (Ed.), ZEW economic studies: Vol. 13. Econometric evaluation of labor market policies (pp. 43–58). Marshall, K. T., & Oliver, R. M. (1995). Decision making and forecasting. McGraw-Hill. Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55.
260
G. Fahner / International Journal of Forecasting 28 (2012) 248–260
Rosenberger, L., & Nash, J. (2009). The deciding factor. San Francisco: Jossey-Bass. Rubin, D. B. (1997). Estimating causal effects from large data sets using propensity scores. Annals of Internal Medicine, 127(5), 757–763. Rubin, D. B. (2006). Matched sampling for causal effects. New York: Cambridge University Press. Rubin, D. B., & Waterman, R. P. (2006). Estimating causal effects of marketing interventions using propensity score methodology. Statistical Science, 21(2), 206–222. Thomas, L. C. (2009). Consumer credit models. Oxford University Press.
Gerald Fahner is Senior Director of Research at FICO, where he leads the Predictive Technology Group. His team develops and tests innovative approaches and algorithms for a variety of business applications, from experimental design and prediction to decision models and optimization. Gerald is also responsible for the core predictive algorithms underlying FICO’s scorecard technology. Prior to joining FICO in 1996, he served as a researcher in machine learning and robotics. Gerald received his physics diploma from the University of Karlsruhe and earned his computer science doctorate from the University of Bonn, both of which are in Germany. He lives in Austin, USA.