Pergamon PII:
Information Systems Vol. 23, No. 7, PP. 4233437, 1998 Q 1998 Published by Else&r Science Ltd. All rights reserved Printed in Great Britain !50306-4379(98)00021-0 0306-4379/98 S19.00 + 0.00
Invited Project Review DATA MINING IN FINANCE: USING COUNTERFACTUALS TO GENERATE KNOWLEDGE FROM ORGANIZATIONAL INFORMATION SYSTEMS+ VASANTDHAR DepartmentOfInformationSystems,
Stem School of Business New York University, New York, NY 1~12
(Received 13 March 1998; in,final revisedfom
30 October 1998)
Abstract A common view about data mining is that it is an exercise of “clustering” customers, markets, products, and other objects of interest in useful ways from large amounts of data. In this paper, I demonstrate that the real value of data mining, particularly in the financial arena, lies more in revealing actions that lead to interesting distributions of outcomes, distributions that are not directly observable in the data. Simulating actions or events that did not occur is often mom useful than trying to cluster the data, which represents only those states of nature that occurred and were recorded. I show how the use of counterfactuals, which am hypothetical events, coupled with certain types of machine learning methods produce models that promote human dialog and exploration that does not otherwise occur in routine organizational activity. I demonstrate with real-world examples, how the combination of database systems, counterfactuab and machine learning methods combine to provide a powerful bottom-up theory building mechanism that is useful in enabling organizations to use databases for learning about things that are useful to them. ‘0 1998 Published by Elsevier Science Ltd. Key Words: Knowledge Discovery, Machine Learning, Counterfactuals, Organizational Learning
1.
INTRODUCTION
Effective learning requires regular cycles of human deliberation and dialog coupled with feedback about outcomes from earlier actions. But this does not happen often enough in organizations for several reasons. First, some actions just don’t happen often enough, so the dam on outcomes are sparse. Secondly, history is recorded selectively, only for states of nature that actually occurred. Thirdly, there is a cost and time associated with experimentation or exploring new possible actions; tangible benefits must arise quickly and be identifiable with previous actions in order to make experimentation and exploration worthwhile. Modern information systems are for the most part query-driven, where databases serve as a repository of history and for efficient general purpose query processing. While query-driven systems are good for reporting purposes where information requirements are well defined, they are not appropriate for exploration, where one is trying to develop a better understanding of the problem domain. Organizations learn through a variety of mechanisms, such as “‘doing” (i.e. actions) and “interpretation” of data or events [13]. However, such mechanisms have a systematic bias against some important types of learning [12]. To understand why, consider Figure 1 (from [21]) that shows the relationship between the decision to accept or reject alternatives and the subsequent outcome of the decision. The horizontal axis is partitioned into two areas, one where the decision input falls below a threshold level and the alternative is therefore rejected, and one where it is above the threshhold and therefore accepted. The vertical axis shows success and failure, classifying outcomes according to whether they fall above or below a goal or aspiration level. A false positive, or Type I error, is a situation where an alternative that is accepted turns out to be bad choice. A false negative, or Type II error, is one where a potentially good alternative is rejected or not considered.
‘Recommended
by Matthias Jarke 423
VASANTDHAR
424
As several organizational theorists have pointed out, organizations are more likely to learn by rejecting bad alternatives than by discovering good ones, because they are able to observe the outcomes of the bad decisions, while no such information is available about decisions for which no outcomes are observable [ 151, This leads to a self-correcting bias towards Type I errors, where bad alternatives get corrected over time because they are recognized as bad. On the other hand, Type II errors, the rejection or lack of consideration of good alternatives, are not addressed because no information about their outcomes is available. This limited exploration of potentially good choices limits the ability of an organization to learn.
Success
lres
3r)
True negati! res
Failure
Reject
Accept
Fig. 1: TheRelationship Between Actions and Possible Outcomes
Figure 1 also demonstrates somewhat indirectly, another impact of the paucity of history. The ellipse represents the correlation between decisions and outcomes: a correlation of 1 results in a straight line (no errors), while a correlation of 0 results in a circle. Sparse history, reflected by the absence of information relating actions to outcomes, provides decision makers with little confidence about the shape of the ellipse and consequently, the likelihood of encountering the two types of errors. If we generalize Figure 1 by making the horizontal axis multidimensional (or a vector of inputs), the ellipse becomes a multidimensional ellipsoid. Further, if the acceptance or rejection decision is based on values between two boundaries instead of simple cutoffs, the various outcomes are not located in contiguous areas of the ellipsoid but are discontinuous across the multidimensional space. Locating these areas is a difficult problem, and scarce history makes it even harder to estimate with any degree of confidence the distribution of outcomes corresponding to various input conditions. The problem becomes harder as the number of inputs increases. This paper is demonstrates how counterfactuals coupled with certain types of machine learning algorithms provide a practical method for generating interesting conditional distributions of outcomes from large database systems. These distributions can be used to estimate correlations between actions and outcomes more accurately, that is, to provide insight into the shape of the ellipsoid of Figure 1. They are also useful in addressing the problem of Type II errors. Counterfactuals have been used extensively in logic as a means for reasoning about causation [24] and hypotheticals [20]. They are hypothetical events that never actually occurred, but if they had, would have resulted in a particular set of outcomes. Counterfactuals provide an answer to the problem of Type II errors as long as we can evaluate their
DataMining
in Finance:
UsingCounterfactuats to GenerateKnowledgeFrom Organizational
Information Systems
425
consequences with a reasonable degree of precision and the appropriate machinery exists for generating and evaluating them. In other words, we need a focused generator of hypotheses, and a precisely definable evaluation function. A precisely specified evaluation function provides the link between potential actions and outcomes. The database is used to guide the search for the more interesting actions, those that lead to good outcomes. I first present two real-world examples from the financial industry to highlight the issues involved in learning from data. Both examples are from successful applications that are in routine use in a large financial organization. The first problem involves learning about customer relationships from transaction data, namely, which customers are “better” than others in a financial sense. A transaction is a trade that specifies that a specific volume of product was bought or sold by a customer at a particular time through a particular salesperson. When hundreds of thousands of such transactions are conducted daily, they provide a rich source of data that link customers, products, and the sales process. The organizational objective is to learn about how to deal better with the various types of existing customers. The second problem involves learning about the dynamics of financial equity markets, specifically, how prices of equities are affected by various types of data that flow into the market on an ongoing basis. The data consist of items such as earnings announcements (or surprises), analyst projections or revisions about earnings relating to companies or industry sectors, ongoing balance sheet and income statement disclosures by companies, price and volume momentum, news reports, and so on. There is widespread belief in the financial industry that such data impact market performance of securities. But most simple relationships turn out to be inaccurate, and the complex ones are often too difficult to comprehend and reconcile with intuition. This makes the problem highly challenging in that the discovered relationships must be as simple as possible in order to enable decision makers to build a plausible domain theory, and at the same time achieve high enough levels of accuracy in prediction. The contributions of this paper are fourfold. First, the paper presents an “existence proof’ for the application of knowledge discovery methods to large scale industrial problems. Since the applications have been in use for almost three years, it is reasonable to point to them at least anecdotally as successes of data mining in Finance. Second, I show explicitly how counterfactuals can be used to overcome a major learning impediment in organizations, specifically, of not dealing adequately with Type II errors. The important lesson here is that the value of the historical data is not just in the data itself, but in using it to generate conditional distributions of outcomes which are useful in helping decision makers build a theory about the problem domain. Thirdly, I provide a framework for deciding when the results of a data mining effort, the patterns, are usable for decision support, and when they must replace human decision making. The framework reveals that the extent to which decision making can be automated depends on more than the inherent “degree of structure’* of a problem as proposed by Simon [22, 231. Finally, the paper demonstrates how features of the application domain focus the exploratory exercise of data mining. Contrary to some popular belief, data mining is not an undirected data dredging expedition where machine learning algorithms automatically produce interesting patterns. Rather, the exercise is probably better viewed as an iterative parameterization of an existing framework of knowledge, where the parameterization can be less like a statistical regression and more conditional in nature, in the spirit of rules. A practical benefit of rules is that they are easy to understand, and useful for reasoning about events and causality. The remainder of the paper is organized as follows. I first introduce the two examples in order to provide the context for the remainder of the discussion. I then demonstrate how the use of counterfactuals and a specific machine learning approach can be used to generate and evaluate interesting hypotheses. Finally, I provide a framework that shows when it makes sense to use the results from a data mining exercise for decision support purposes and when to remove the human from the decision making process. 2. THE PROBLEM: GENERATING USEFUL KNOWLEDGE FROM DATA Example 1 In a large securities firm, two senior managers disagreed about which of their customers were more profitable. One wanted to develop closer relationships with the large customers, arguing that they provided more and better business, in this case, commissions for transactions. The other countered that the large ones were really the bullies, using their market power to extract costly concessions. Based on his observation of activity on the trading desk, he argued that the larger customers often extracted lower
VASANTDHAR
426
commissions and their transactions were often “capital hungry”, that is, often required putting capital at risk. After a few rounds of discussion with the managers about how to conceptualize the problem, the relevant problem variables were sketched out as shown in Figure 2. This sketch provides a preliminary model of the domain, including in a rough sense, causation. The figure shows customer attributes, such as how much trading volume they generate in terms of trades and dollars, and their dealing style, that is, whether their trades required putting capital at risk. These attributes can be inferred from the trades database, aggregating the data monthly, observing the distributions of the variables such as trade size, and classifying the customer accordingly. The other attributes, such as account type, fund size, and whether they have a centralized dealing operation are demographic, available from the master accounts database. Each of the customer attributes has a “market impact”, specified in terms of market power of the customer, the size of their trades, and whether their trades tend to contribute to or drain liquidity from the market. For example, a seller in an environment where there are mostly buyers would be providing liquidity, whereas a buyer would be draining liquidity. Providing liquidity has less market impact than draining liquidity, so other things being equal, is more profitable. The market impact variables have a direct financial impact: how much revenue they provide and the costs they induce. The most easily measurable of these variables is commission revenue since it is recorded with each trade. Other variables include the risk associated with executing customer trades, specifically, the risk of carrying a position that would not otherwise be taken, and/or incurring excessive execution costs. This can be computed using trade and market data at the time the trade was conducted and subsequent to it. Finally, there is a cost of dealing with the customer, which includes a portion of fixed costs, and variable costs associated with servicing the customer. The cost of servicing the customer can be computed using cost accounting data, but this is not a precisely measurable variable because of the difficulty of allocating costs to customers. The dependent variable is customer profitability, which is a dependent on commission revenue, risk, and cost of servicing. We do not know the nature of the relationship. CUSTOMER A’ITRIBUTES i Account
i
MARKET IMPACT
cl
PERFORMANCE
FINANCIAL IMPACT
; 4 I I b
Cost of Servicing (Sales Trader, Analyst, & Salesman effort) \
Margins/Our Risk (Trade Profitability)
-
-b
I
Commission Revenue Volume Fig. 2: A Hypothesized Model of CustomerProfitability
Why would managers have differing opinions about the profitability of customers?
Customer Profitability
Data Mining in Finance: Using Counterfactuds to Gznemte Knowledge From Organizational Information Systems
427
It should be apparent from Figure 2 that customer profitability is not simple to determine unless we pull together the relevant data from the databases and analyze it carefully. In fact, the considerable disagreement among senior managers about what types of customers were profitable stemmed, in effect, from the different weights being implicitly assumed by them for each of the arrows in the figure. For example, the first manager discounted the effect of the customer’s market power on profitability. The second acknowledged the higher commissions, but stressed the costs and risks associated with doing business with the large customers. While the identification of the more profitable customers is easy once the relevant data is put together, determining what types of customers are more profitable is a lot harder. With a database having 20 attributes, each with a domain of 10 values, the number of possible combinations is 102’, a very large number for this relatively small problem. We therefore need a focused generator of hypotheses so that the relevant relationships are learned quickly. One way to think about the learning is in terms of finding the parameters of the model sketched out in Figure 2. Indeed, a linear model could be used for this purpose. However, this often turns out to be an inappropriate way to formulate the problem because the relationships among variables often tend to conditional, like, “the cost of servicing an institutional account is high only when the trading volume is low”. Or, “the cost of servicing rises only when accounts are below a certain size”. Simple numeric weighting methods also tend to obscure rather than clarify the problem, especially when the relationship among two variables is not uniform across the range of values for the variables, or it is mediated by other variables. A potentially more fruitful way of thinking about the problem is to try and estimate the more interesting conditional distributions of outcomes. The reason for restricting ourselves to the more interesting distributions is that not all attribute combinations are worth exploring, nor is it feasible to construct a complete multivariate distribution of outcomes, especially when the problem space has many inputs. In such cases, many parts of the problem space (the Cartesian product of the inputs, namely all attribute-value combinations) may be meaningless and/or not represented in the database. For this reason, it is more useful to think of the learning exercise as producing conditional distributions that are statistically supported by the data, where the conditional is basically a Boolean expression. Such conditionals can be represented in a variety of ways such as clauses or logical propositions, as linear algebraic constraints, or as rules [IO]. These rules can be critiqued by experts, thereby enabling us to build a domain theory from the data. Example 2 Securities firms are interested in knowing how prices of equities are affected by various types of data that flow into the market on an ongoing basis. The market data consist of earnings announcements (or surprises), analyst projections or revisions about earnings relating to companies or industry sectors, quarterly balance sheet or income statement disclosures by companies (the fundamentals), technical indicators such as “relative strength” which represents price momentum, industry reports, and macroeconomic factors. The challenge is to learn about the relationships among these variables and incorporate these insights into profitable trading and risk management. The ability to trade intelligently is one of the most important problems facing securities firms. As in the customer profitability example, the first order of business is to identify the right types of variables and define the types of hypotheses that are likely to be useful. Figure 3 shows a rough partial model characterizing this problem As in the previous problem, the challenge is to find the interesting conditional distributions of outcomes, in other words, rules that express robust relationships among the problem variables. For example, a discovered rule for this problem could be something like the following:
“Positive/negative earnings surprise is associated with positive/negative future returns” A refinement or specialization of this rule would be to find regions of positive and negative earnings surprise where the correlation with future returns is highest. For example, a refinement might be that positive or negative earnings surprises exceeding 2 standard deviations from the consensus amount have a significant impact on future returns, or that strong fundamentals coupled with positive/negative earnings surprise are associated with positive/negative future returns. Again, rules are a good way of expressing 0 this type of knowledge and for understanding the problem domain.
VASANTDHAR
428
FINANCIAL VARIABLES
FINANCIAL IMPACT
Earnings Revisions
Earnings Surprise
PERFORMANCE
Future Returns
Fundamentals Investor Accumulation
Relative Strength
/
Industry Reports Macroeconomic Indicators Fig. 3: A PartialModel of Equity Performance
3. COUNTERFACTUALS AND MACHINE LEARNING Table 1 shows two types of data, where each row consists of two types of attributes, {Ai, AZ... A,,} and { Oi, Oz.., O,,}. Attributes Ai for i = 1 to n-l are used to describe the states of nature, and A, describes potential actions in those states. Using the subscriptj to denote rows, Aij represents the value of attribute i in record j. The tuples [Aij, A,. . .A”_ij) represent states of nature that are recorded in the database. An action can be taken in the state of nature represented by a database record+. For instance, using the second example above, actions to buy a security might be taken whenever the value of the Earnings Revision attribute exceeds a positive number, say 80. The vector (01, Osj,. . .O,} represents the outcome of the action taken in the corresponding state of nature. For example, the first record states that if the action is to buy the security “IBM’ when its Earnings Revision score is 85, the total return over the 30 days is 5% (Oi), the volatility of the returns is 22 (0~)~ and so on. The output vector is like a dependent variable, except that it is multidimensional. It is worth noting that the outcomes are not usually part of a database, but can usually be precomputed and appended to the database.
States of nature
Actions
Outcomes
Table 1: A Table with States of Natureand Outcomes
Note that the action to buy or sell is hypothetical since it didn’t actually occur. The counterfactual consists of the conjunction of the action with its corresponding state of nature. More specifically, it is a Boolean expression over (Ai) such as ‘Earnings Revision > 80 AND Action = buy”. The outcome vector is simply the evaluation assuming that the counterfactual is true.
’ The actions ‘buy” and “sell” are peculiar to this example. In a credit approvalexample, they may be “allow” or ‘&fuse”. In direct marketing, they may be “send” or “don’t send”. They may also be “null”, in which case there is a direct relationshipbetween the states of natureand outcomes correspondingto them. Customersegmentationproblemscould be like this, where the states of nature. might be customer demographicsand/orbehavior,and outcomes are revenue,cost of servicing, and so on.
Data Mining in Finance: Using Counterfactaals
to Generate Knowledge
From Organizational
Information
Systems
429
It is important to recognize that evaluating a counterfactual is a computationally expensive operation. TO appreciate why, consider a database consisting of a million records, where the counterfactual “Earnings Revision > 80 AND Action = buy” is evaluated. The evaluation requires performing tests on the relevant attributes in the counterfactual a million times, and computing the various moments (averages, standard deviations, etc.) for each of the outcome attributes. In the example above, this requires computing the distributions of future returns, volatilities of returns and so on - for one counterfactual. This is an expensive operation even if the entire database is in memory. Counterfactuals, when generated by machine learning algorithms, are useful in helping find the more interesting rule-like relationships among subsets of the database. More specifically, they help us focus on generating the more interesting conditional distributions. Implicit in this line of attack is that the complete multivariate distribution of outcomes is neither possible to generate because of prohibitive computational costs, nor is it worthwhile trying to do so since large parts of the complete problem space are likely to be uninteresting. How should counterfactuals be ranked? There are a number of information-theoretic and statistical tests that can be used to compute the interestingness of the counterfactual that we describe below. Assuming that we are looking for a rule-like structure, it makes sense to use representations and algorithms that deal with such structures. A commonly used structure is to have each element of a rule be a Boolean expression defined over the problem variables. If the rules are conjunctions of such expressions as in the example above, the problem is said to be in disjunctive normal form [ 171. Two natural candidates for generating counterfactuals are rule induction algorithms and genetic algorithms. Rule induction algorithms are commonly used in the machine learning community. Genetic algorithms have been less popular, primarily because they are slow, even though they are much more thorough in their search f3, 14, 181. Figure 4 shows an example where the data from the customer profitably example is partitioned successively into clusters, where each cluster contains a certain “slice” of the data. Since the partitioning is successive, each cluster is “purer” than its ancestor in terms of the similarity of the dependent variable values of the cases in it. We have assumed for simplicity that the dependent variable and the fund size are categorical (high and low), and that the account age is continuous. The leftmost cluster in Figure 4 shows the complete data set, containing low revenue consumers as crosses and high revenue ones as circles respectively. Their proportion in the database as a whole is 50:50. The first split, on Fundsize, produces a slightly higher proportion of “Profitable” consumers. But why did we split on this variable? In this example, it turns out that the reason is simple: this attribute produces the best improvement, where improvement is defined in an information-theoretic sense: of reducing the entropy of the original cluster. (Note that any proportion would have lower entropy than a uniform distribution such as 50:50.) The small fund size group is further partitioned on AccountAge, under and over 5, yielding a cluster where 62.5% of the cases are high consumers. The “rule” or pattern that emerges is that smaller accounts that are less than 5 years old belong to the profitable category, or more formally: IF FundSize.X = “Small” AND Accounb4ge.X < 5 THEN Profit.X = “Large”
(Rule 1)
where X refers to a particular exemplar. The parts of the rule before and after the “THEN” are referred to as the antecedent and the consequent of the rule respectively. Assuming that the rule above is a robust one, a problem we address more fully in next section, it appears on the surface to validate the second manager’s hypothesis, that smaller accounts are more profitable. However, what does the age of the account have to do with profitability? The result was puzzling initially, but several alternative hypotheses were offered. Some were based on the demographics of the newer accounts. Others were based on the composition of the sale force having changed in the last five years, that composition of the relationships might have changed recently. Some of these hypotheses were verifiable from the data, while others were not. The important point that the result illustrates is that the data mining exercise leads to a deliberation about missed opportunities and potential possibilities that would not otherwise occur in routine organizational activity. The data mining exercise forces an exploration of possibilities and questioning of prior beliefs, leading to a better understanding of the problem domain.
VASANT DHAR
430
Rg. 4: An Example of how a Tree Induction Algorithm Partitions Data into Homogeneous Classes (Cross = Class I = ‘Low” Profit Consumers, Circle = Class 2 = “High” Consumers)
The same lesson applied to the market prediction problem. To illustrate how, consider the following. The hypothesized relationship mentioned earlier, “higher surprise leads to higher returns” was more accurate in the tails of the distribution of earnings surprise. .That is, it holds only when the surprise is more than, say, one standard deviation from the consensus mean earnings estimate. Otherwise, the effect is non-existent. What this says in effect is that most of the cases, those within one standard deviation, are essentially “noise”, whereas the “signal” is located in selected areas, namely, the tails. Figure 5 shows this distribution of earnings surprise, with the areas marked as “action” being the area where the effect occurs.
“Earnings consensus
surprises
more than +X standard
mean lead to oositive
Degree Fig. 5: The “Action”
returns
of Earnings
Areas Signify
deviations
from the
over the next N days”
Surprise
that the Relationship
Holds in the Tails Only
We can as easily construct examples where the effect, the interesting conditional distribution of the dependent variable, might occur everywhere except towards the tails, and so on. The point is that the “signal” is located only in certain selected areas, which makes the relationship between the independent and dependent variables nonlinear. The challenge is to find these nonlinear relationships. The effect on the dependent variable is often due to the intemction of two or more independent variables. Indeed, Holland [9] characterizes complex systems as ones where the interaction effects become more important than the effects of each variable taken separately. For example, the earnings surprise effect holds more strongly when the fundamentals of the company, measured in terms of variables such price/earnings ratios, are strong. Such interactions effects are basically nonlinear. It is easy to see how a tree induction algorithm would generate a decision tree like that of Figure 4 with such data.
Data Mining in Finance: Using Counterfactuals to Generate Knowledge From Organizational Information Systems
431
Rule induction algorithms generate the counterfactuals sequentially. Another type of learning algorithm that can generate counterfactuals in parallel is a genetic algorithm, which is motivated by theories of biological natural selection [2,9]. Figure 6 shows how a counterfactual can be represented for a genetic algorithm, in terms of a “chromosome” consisting of genes. In this case, the counterfactual or hypothetical action is “buy when the PE Ratio between 1 and 5 AND Earnings Surprise is greater than 2 (standard deviations)“. The pairs of genes in Figure 6 represent Boolean expressions such as “PE Ratio between 1 and 5”. The chromosome as a whole in this case represents the conjunction of these Boolean expressions. ‘The asterisks represent “don’t care” values for the other variables. The fitness score of this counterfactual is 78. It is possible to associate various types of semantics with a chromosome structure, as long as the evaluation machinery is able to interpret the chromosome. A genetic algorithm works on a population of chromosomes, each representing a hypothesis that is evaluated against the database. As Figure 6 shows, a chromosome represents an entire trajectory of a decision tree, from the root to a leaf node. The genetic algorithm in effect generates a counterfactual - a chromosome - at one shot. It can evaluate an entire population of chromosomes simultaneously because evaluation is completely parallelizable. The fitness metric is used to rank order the chromosomes in the population.
PE Ratio
Dividend Yield
Earnings Surprise Relative Strength
Fitness
Fig. 6: A Chromosome Representing a Boolean Expression
Ranking the chromosomes provides the genetic algorithm with an effective selection mechanism. The genetic algorithm works by creating populations of competing hypotheses, ranking them, and exchanging information among the promising competitors with the purpose of generating better hypotheses. Standard operators of crossover and mutation are used for exchanging information between chromosomes and exploring new combinations of expressions defined over the independent variables 171. One of the more powerful features of this algorithm is its ability to learn from the results of prior hypotheses, that is, to refine its hypotheses as it goes along. The algorithm is able to exploit its flukes to dramatically improve search performance. The genetic algorithm also has a natural parallel flavor to it, which makes it well suited for dealing with combinatorial problems. When search is combinatorial (and we cannot use numerical methods or gradient descent), the genetic algorithm is a highly effective method. The evaluation function of tree induction and genetic algorithms can be simple statistical or information theoretic tests, or more complex programs. In the example of Figure 4, each split is on a single variable, based on entropy reduction. The entropy of a cluster i can be computed using the standard formula: Hi = -
CPi i
lOg*(Pi)
where pi is the probability of an example belonging to the irhcluster. When applied to a dataset i, Hi, the entropy, measures the average amount (number of bits) of information required to identify the class of an exemplar in the data set. The entropy of a cluster is at a minimum where the probability is 1, that is, all members belong to the same class, and maximum where an exemplar is equally likely to belong of any class, as in the leftmost cluster of Figure 4. The gain from a split is computed based on the difference between the entropy of the parent cluster and the total entropy of the child clusters resulting from the split. That is, if a cluster i is partitioned into j subsets: gaiqj --Hi-
CH,i
* Rj
432
VASANTDHAR
where Rj is the ratio of the number of cases in cluster j to those in i. This is an information theoretic measure of the value of information obtained from the split. It is the amount of discrimination that the split produces on the distribution of the dependent variable. In computing the goodness of a split, this gain value needs to be normalized so that fewer splits and larger clusters are favored over smaller ones. Otherwise the algorithm would be biased towards producing very small clusters (in the extreme case, of size 1 since these would minimize entropy) which typically result from overfitting the data and are therefore useless for predictive purposes. There are a number of heuristics for implementing this normalization [see, for example, 19, 1, 161. Is the production of counterfactuals simply the execution of a clustering algorithm? Actually, it is more general. Any machinery than can generate rule-like propositions and evaluate them is usable. In the examples we considered, the outcomes were precomputed and specified as part of the data record for simplicity. But the evaluation can be dynamic, requiring the execution of a program, such as a Monte Carlo simulation, or training a neural network and using the output of the trained net as the fitness score for the counterfactual. Even if precomputation of the outcome vector is possible for each recorded state of nature, evaluating the counterfactual is still expensive. Recall that in the equity prediction example, the performance depends among other things, on the returns and the volatility of returns associated with “executing” a counterfactual or a hypothetical trading strategy. Its evaluation results in output that is not part of an existing database. In other words, “applying” the relationship being hypothesized is a more general and computationally expensive activity than data clustering. It involves envisioning the consequences of an action. 4. THE EVALUATION OF COUNTERFACTUALS What makes counterfactuals powerful is that instead of asking “What data match the pattern expressed by this query?‘, we turn the process of discovery around, to answer the question “What patterns match this data?’ By turning the question this way, we effectively automate a creative aspect of theory formation, namely the generation of hypotheses, their evaluation, and refinement. However, there is a well recognized problem with this process of data mining: if you try long and hard enough, some model is likely to perform well when applied to the dam! So, how do we know whether the discovered model is simply “fitted” to the data, or whether it expresses some robust relationship among the problem variables that is likely to hold in the future? Suppose for example, that we obtained the results shown in the top part of Figure 7 by applying the earnings surprise rule. The data to the left of the solid vertical line in the upper graph shows the performance of a model in terms of monthly returns on the data on which it was trained and tested. The data to the right of the bar shows its performance on new data. The lower graph shows the performance of the model on a completely new universe of instruments across the entire time period. In a sense, this is a stronger test than the previous ones because not only are we expecting the relationship to hold across a different universe demographically, but also across different time periods. How do we evaluate the results from applying the rule? Does the top part of Figure 7 represent a “good” outcome? There are a number of standard statistical tests for such problems. One common method is to compare the performance to some benchmark, like the S8zP500 index. Another is to check for the consistency of the rule across time or in a different universe. In the example of Figure 7, for instance, if we examine the monthly performance of the model across the two asset classes, we see the distributions of performance in Figure 8. The bars labeled “A” correspond to the top part of Figure 7 whereas those labeled “‘B” are from the bottom part. A standard statistical test, such an F-test or a Chi-squared test can be performed in order to determine the degree of similarity of the underlying distributions. This would tell us whether the performance of the model on the two asset classes across time is similar. In addition, since there is a time component, we might also be interested in examining the crosscorrelation of the performance on the two different data sets, depending on the assets being used. If the model has discovered a stable underlying pattern in the data, then the performance be consistent across similar market conditions.
Data Mining in Finance: Using Counterfactuals to Generate Knowledge From Organizational Information Systems
433
0.25 0.2 0.15 0.1 0.05 0 .0.05 -0.1 .0.15
Fig. 7: Performance of a Model Across two Asset Classes and Across Time
20 16 16 14
12 E f to E y 6
Fig. 8: Performance Distributions of Discovered Model on Original Asset Class (A), and out of Sample Asset Class (B)
In the above example, the evaluation was performed across time as well as across a different universe. What is interesting is that the evaluation machinery can be coupled tightly to the counterfactual generation machinery, so that most of the work of validating the pattern thoroughly is done before it is presented to decision makers. It should also be apparent that the evaluation machinery can be simple, such an entropy reduction, or complex, like a Monte Carlo simulation. Indeed, returns in the chart in Figure 7 were based on a simulation of the rule on historical data.
VASANT DHAR
434
Whatever the evaluation method, it is important to identify patterns that turn out not to be statistical coincidences. For the prediction problem, it is important to simulate a hypothesis over time and depending on the generality we expect, on other universes. In the customer profitability example, evaluation across time may not make much sense, while some other basis, such as geography, might be a more meaningful way to partition the universe. Figure 9 provides a simple framework that illustrates the different types of validation possibilities [5].
Across Time YES
NO
AmI Out of ssmple
Out of sample Out of universe
Am
Out of sample Out of time
Out of sample Out of time Out of universe
Rg. 9: Testing Strategies are Based on Whether They Account for Variances Across Time and Across the Data &krse. Dark Circles Represent Training Data and White Circles Represent Testing Data. Gray Circles Represent Data that May or May not be Used for Testing
The figure breaks up the testing data along two dimensions: time and universe. The upper left quadrant might correspond to the customer profitability example we described earlier. In this case, the testing data could be chosen completely randomly from the full data set. This implies two things. First, that the phenomenon captured by the data stays relatively stable over time. And secondly, that the data, the behavior of the customer population, stays relatively homogenous over time. Commonly used testing techniques such as the bootstrap and cross validation are forms of this approach [6]. The lower left quadrant encourages us to partition data universe into different segments. An example of this situation is one in which a model is trained on customers in New York, but tested on those from other major cities around the country. In this case, the data universe is heterogeneous. We might also split customers according to size, trading frequency, and so on in order to test whether a pattern holds across alternative partitionings of the data. The upper right quadrant describes testing in which data for training are chosen for any time period prior to a certain date and testing data is selected from time periods only after that date. A market prediction model might be trained using data from 1990 to 1995 and tested on data from 1996-1998. The lower right quadrant further partitions the testing across time and universe as shown in Figure 7. In this case, not only are the test data from a different universe, they are also from a different time period. In problems where time plays an important role, rules derived from data have a much higher chance of being robust if we break down the data across universe and time. When the data are plentiful, this ensures that the discovery algorithm will pursue only those hypotheses that perform consistently across the various sets of data instead of those that are very good on some subsets and poor across others. This requires thinking carefully before the data mining exercise about why we would expect a model to perform uniformly across the various subsets of the data. In practice, these subsets might correspond to different customer segments, different time periods and so on. Another way of viewing the above is that data mining is not an undirected bottom-up fishing exercise. Data mining experiments take time to set up and interpret. These experiments are not costless;
Data Mining in Finance: Using Counterfactuals to Generate Knowledge From Organizational Information Systems
435
indeed they can be quite expensive. For this reason, it is important to formulate the problem carefully with a priori hypotheses about the types of relationships we should expect to discovery and why. In other words, while the data mining exercise is “exploratory” as pointed out by Tukey [25] the exploration must be thought out as carefully as possible at the start of the exercise. 5. THE USAGE OF DISCOVERED PATTERNS In the previous section, I stressed the need for rigorous testing of counterfactuals. An obvious question that follows is that if the discovered patterns are indeed robust, why not eliminate the &&ion maker from the loop altogether? It turns out that the answer to this depends on more than the extent to which the problem is ill structured [23]. Many frameworks in the DSS literature [8, 111 are based on Simon’s notion of programmable versus non programmable decisions [22]. The basic idea is that judgement is required for the non programmable problems and models are therefore useful in supporting human judgement, not in automating it. It turns out, however, that the issue of automation versus support depends more critically on other factors. First, it depends on the crispness of the payoff function. The various DSS frameworks assume that the more ill structured problems have poorly defined payoff functions. However, as we can see in the market prediction example, the payoff function is precise, but the problem is highly ill structured. Does it make sense to automate decision making just because the payoff function is well defined and can hence be computed by a model? The answer hinges on a second consideration, namely, the objective of the data mining exercise. If the objective is to develop a theory about the problem domain, it is imperative that the decision maker be replaced from the loop. The model must be tested without human intervention, otherwise it is impossible to separate the contribution of the human from that of the machine learning based model. This point poses an interesting paradox. If we’re using machine learning methods and counterfactuals to uncover patterns in complex nonlinear situations where the output of the model is a decision, then we must be particularly careful in ensuring that it is the learned model that is separating the noise from signal and not the decision maker. Otherwise we don’t know whether the learned model captures the true structure of the underlying problem or whether human judgement is compensating for a poor model. Consider the earnings surprise example. The top part of Figure 7 shows the returns assuming a completely systematic or automated implementation of a discovered rule. The part to the left of the heavy vertical bar shows the performance of the automated implementation based on historical data. If the model is any good, or alternatively, if it truly captures the underlying structure in the data, we would expect the performance to be good on data it has never seen before. As it turns out from observing Figure 8, the learned model performs consistently when applied to a new universe. But if the results were achieved through human intervention, we would not know whether our learned model captured a real effect in the data, whether the decision-maker steered it right when it made bad decisions, or wrong when it made good ones. The counter intuitive point above is that even highly non programmed decision situations may require completely automated solutions in order to test the model. In the example above, where a theory about market behavior is being tested, the precision with which the payoff function is defined and the objective of the modeling exercise (i.e. whether it is theory formation) have more on an impact on whether a decision is automated or supported than problem complexity. Even though the decision is a complex one, the payoff function is precise: the profit or loss on a trade and the data to compute this are available. Automation provides a clean and rigorous way to test the theory. The other point is that problems that are more structured may not be easy or practical to automate. In the example about the sales manager’s conjecture about profitable sales strategies, coming up with a precise function relating actions and payoffs is difficult. For example, if a relationship is discovered between trade size and profitability, it cannot be used to literally automate the behavior of salespeople. Other than the fact that such a relationship can be turned into a precise directive, it would be virtually impossible to determine whether sales people were actually trying to generate larger trades, nor would there be a precise relationship between their actions and the financial outcomes. Table 2 summarizes the discussion in this section indicating automation or support, depending on the precision of the payoff function and theoretical objectives. The upper left quadrant, exemplified by the
436
VASANT DHAR
trading example is one where automation makes sense. In this quadrant, the relevant data for theory formation are available, and the payoff function is well defined.
Theory formation?
$1
Table 2: Decision SupportVersus Automation
The table also explains the relative preponderance of systems that support rather than automate managerial decision making. In most management problems, intangibles and risks of automation make it hard to define payoff functions precisely. However, the financial domain is somewhat of an exception in this respect since many of the problems that require analytical decision support, such as trading, risk management, and hedging, also have precisely specifiable payoff functions. For this reason, we may see more decision automation in the financial arena going forward. To some extent, this has already happened with problems such as program trading and risk arbitrage where humans have been replaced by technology. 6. CONCLUDING REMARKS In conclusion, a few summary remarks are in order. My first objective in this paper was to show how counterfactuals and machine learning methods can be used to guide exploration of large databases in a way that addresses some of the fundamental problems that organizations face in learning from data. The solution I have proposed makes it possible for us to estimate with some degree of confidence the conditional distributions of outcomes, that is, conditional on various counterfactuals being true or false. Generating these conditional distributions can provide a better estimate about the correlation between the more interesting actions and outcomes, which are hard to assess otherwise. The major organizational benefit of this is the mitigation of Type II errors. The above claim is not a hypothetical one. Several managers who have participated in data mining exercises for the two problems discussed above, customer profitability and trading, have testified publicly about how the exercises induced them to explore and understand problem areas they would not have considered or been able to resolve as part of regular organizational activity. While my claims are anecdotal, there is currently considerable business activity in the areas of data warehousing and data mining that suggests that organizations are investing considerable resources in order to better leveraging their data resource. The methods I have described in this paper are useful in helping us understand why this is a useful endeavor and how organizations can benefit from it. It should also be noted that the solution I have proposed requires a crisply definable evaluation function even though the problem might be ill-structured [23]. Problems in finance often fall into this category. Even though investment decisions are complex, performance is precisely measurable. Similarly, customer profitability is measurable, although somewhat less so because of the difficulty of allocating costs precisely. Likewise, credit risk problems are also complex but measurable in terms of performance as are several other problems in the financial industry. Finally, it is worth making the following distinction. Data mining is often viewed as a “data clustering” or data “reduction” exercise where the aim is to generate useful abstractions ofthe data. While this is an appropriate engineering oriented view, it understates the true power of data mining, namely, as a bottom-up theory building process, where the data serves in enabling a problem solver to extract new and interesting relationships about the problem domain. Counterfactuals provide an effective representation of such relationships, and machine learning algorithms provide a powerful problem solving mechanism for focusing the generation and evaluation of the counterfactuals. In this latter view of data mining, the objective is not just to abstract the data, but to use the data along with a rudimentary specification of the problem variables in order to reason about events that may not have occurred, and are therefore not represented in the data. Accordingly, the problem space that is explored can be extremely large and comprehensive, resulting in the discovery of new and interesting relationships about the problem domain. In both examples described in this article, this has turned out to be the case, leading to insights that would not have occurred otherwise.
Data Mining in Finance: Using Counterfactuals to Generate Knowledge Fmm Organizational Information Systems
437
Acknowledgement - I wish to acknowledge Jim March for valuable comments on an earlier draft of the paper, and Matthias J&e for
useful insights in reframing some of the main arguments in the paper.
REFERENCES [l]
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Ckzsstfication und Regression Trees. Wadsworth, Monterey, CA (1984).
[2]
R. Dawkins. The Blind Watchmaker. W.W Norton and Co. (1987).
[3]
V. Dhar and D. Chou. Methods for Achieving High Support Paper, Department of Information Systems (1997).
[4]
V. Dhar and R. Stein. Seven Methods for Transforming Corporate Datu Into Business Intelligence. Prentice-Hall, Englewood Cliffs (1997).
and Confidence
With Genetic Leurning Algorittuns. Working
[5] V. Dhar and R. Stein. Finding robust and usable patterns with data mining: Examples from Finance. PCAI (1998). [6] B. Efron and R.J. Tibshirani. An Introduction to the Bootstrap. Chapman &Hall, New York (1993). [7]
D.E. Goldberg. Genetic Algorithms in Search, Optimization, und Machine Learning. Reading, Addison-Wesley, MA (1989).
[8] A. Gorry, MS. Scott-Morton. A framework for management information systems. Sloan Munugement Review, 13(l) (1971). [g]
J.H. Holland. Aduptation in Natural and Artijcial Systems. Ann Arbor, The University of Michigan Press (1975).
[lo] J.N. Hooker. A quantitative approach to logical inference. Decision Support Systems, 4(l) (1988). [ 1 l] P. Keen. Decision Support Systems. Addison-Wesley (1978).
[ 12.1D. Leventhal and J.G. March. The myopia of learning. Strategic Mfnagement Journul, Volume 14 (1993). [ 131 B. Levitt and J.G. March. Organizational learning, Ann& Review of Psychology, Volume14 (1988). [ 141 S.W. Mahfoud. A comparison of parallel and sequential niching methods. Proceedings of the Sixth International Conference on Genetic Algorithms (1995). [15] J.G. March. The pursuit of intelligence in organizations. h-r Teresa Lant and Zur Shapira, editors, Managerial and Organizutional Cognition, LEA (1998). [ 161D. Michie, D.J. Spiegelhalter and C.C. Taylor. Muchine Learning, Neural and Statistical Classtficution. Ellis Horwood Ltd. (1994). [ 171N. Nilsson. Amficiul Intelligence. Tioga Press (1990). [ 181N. Packard. A Genetic Leurning Algorithm. Technical Report, University of Illinois at Urbana Champaign (1989). [ 191 J. Quinlan. Machine Leurning and 103. Morgan Kauffman, Los Altos (1992). [20] N. Rescher. Hypothetical Reasoning. University of Pittsburgh Press (1964). [21] Z. Shapira. Risk Taking : A Munugeriul Perspective. Russell Sage Foundation (1997). [22] H.A. Simon, The New Science of Management Decision (1%5).
[23] H.A Simon. The structure of ill-structured problems. Artificial InteNigence (1973). [24] P. Suppes. A Probabilistic Theory of Causation. North Holland (1970). [25] J.W. Tukey. Exploratory Dutu Analysis. Reading, Addison-Wesley, MA (1977).