JACOB ZAHAVl N I S S A N LEVIN
Issues and Problems in Applying Neural Computing to Target Marketing JACOB ZAHAVI i s associate professor of management at the Faculty of Management, Tel Aviv University He holds a PhD in systems engineering from the University of Pennsylvania, an MSc in operations research from the Technion. Israel Institute of Technology. and a BA in economics and statistics from the Hebrew University Previously, he held visiting positions at the University of Pennsylvania, University of Southern California, Cornell University, and the University of California at Los Angeles His current research interest i s in developing decision support systems in database marketing. involving operations research models, statistical methods, expert systems concepts, and computer technology NISSAN LEVIN is currently teaching database marketing. operations research, and statistics at the Faculty of Management Tel Aviv University He holds a BSc in physics and mathematics from the Hebrew University, and a MSc in operations research and a PhD in business administration from Tel Aviv University His current research interest centers on the quantitative aspects of direct and database marketing
ABSTRACT Applying a neural network INN) to the targeting and prediction problems in target marketing poses some unique problems and difficulties unparalleled in other business applications of neural computations. W e discuss several of these issues in this article, as applied to solo mailings, offer remedies to some, and discuss possible solutions to others. A numerical example, using N N backpropagation models and involving realistic data, is used to exemplify some of the resulting issues.
0 1995 John Wiley & Sons, Inc. and Direct Marketing Educational Foundation, Inc CCC 0892-0591/95/03033- 13
JOURNAL OF DIRECT MARKETING
VOLUME 9 NUMBER 3 SUMMER 1995
33
1. INTRODUCTION Neural network (NN) is often mentioned in many theoretical and practical journals as a promising and effective alternative to conventional response modeling in database marketing (DBM) for targeting audiences through mail promotions. NN employs the results of a previous mailing campaign, for which it is known who responded with an order and who declined, to train a network and come up with a set of “weights,” each representing the strength of the connection between any two linked nodes, or neurons, in the network. These weights are then used to score the customers in the database and rank them according to their likelihood of purchase; usually, the higher the NN score, the higher the propensity to purchase. What makes NN particularly attractive for target marketing is the fact that previous examples to train the system abound in DBM, with the number increasing dramatically in recent years due to advances in computer and database technology and data-gathering techniques. The availability of N N computer software and the widespread use and increase in the processing power of PCs, have made NN an attractive tool in DBM, joining a growing list of applications of NN to business problems reported in the literature, for example, firm bankruptcy predictions (21), loan applications approval (41, predicting rating of corporate bonds ( 3 ) , fraud prevention (1 l ) , time-series forecasting (19), and even solving optimization problems (9,lS). But the application of NN to DBM is not all that straightforward as it poses some unique problems stemming from the special characteristics of DBM, which are not usually encountered in other business applications of NN. We investigate several of these issues in this article, as applied to the case of solo mailings, offer remedies to some, and discuss possible solutions to others. A numerical example employing NN backpropagation models and involving realistic data, is used to demonstrate some of the resulting issues. In Section 2 we review the NN backpropagation models in brief; in Section 3 we present the decision process in DBM for solo mailings of new products; in Section 4 we discuss the application of NN for targeting and prediction problems of solo mailings, followed in Section 5 with a discussion of the prob-
34 JOURNAL OF DIRECT MARKETING
lematics involved in applying NN for DBM applications; we exemplify some of these difficultieswith the help of a numerical example in Section 6, and conclude with a short summary in Section 7.
2. A REVIEW OF BACKPROPAGATION NN MODELS N N is a biologically inspired model which tries to
mimic the performance of the network of neurons, or nerve cells, in the human brain. Expressed mathematically, an NN is made up of a collection of processing units (neurons, nodes), connected by means of branches, each characterized by a weight representing the strength of the connection between the neurons. In supervised learning models, these weights are typically determined by means of a training process by repeatedly showing the NN examples of past cases for which the actual output is known, thereby inducing the system to adjust the strength of the weight between neurons. On the first try, since the NN is still untrained, the input neuron will send a current of initial strength to the output neurons, as determined by the initial conditions. But as more and more cases are presented, the NN will eventually learn to weigh each signal appropriately. Then, given a set of new observations, these weights can be used to predict the resulting output. Many types of NN have been devised in the literature, perhaps the most common one, which forms the basis of most business applications of neural computing, is the supervised-learning, feedforward networks, also referred to as backpropagation networks. In this model, which resulted from the seminal work of Rumelhart, McClelland, and the PDP Research Group (13), the NN is represented by a weighted directed graph, with nodes representing neurons, and links representing connections. A typical feedforward network contains three types of processing units: input units, output units, and hidden units, organized in a hierarchy of layers; Figure 1 demonstrates a four-layer system. The flow of information in the network is governed by the topology of the network. A unit receiving input signals from units in a previous layer aggregates those signals based on an input function I, and generates an output signal based on an output
VOLUME 9 NUMBER 3 SUMMER 1995
OUtUUtS
I
i
I
3. THE DECISION PROBLEM IN SOLO MAILINGS In solo mailings customers are offered one product at a time, in contrast to catalog mailings where customers are offered many products to choose from. While perhaps not as large as the catalog industry, it nevertheless constitutes a major segment in the DBM industry. Several types of decisions are usually involved in DBM of solo mailings (10,15):
Inputs
FIGURE 1
Neural Network Topology with Two Hidden Layers function 0 (sometimes called a transfer function). The output signal is then routed to other units as directed by the topology of the network. The input function I, which is often used in practice, is the linear one, and the transfer function 0, either the tangent hyperbolic or the sigmoidal (or logit) function (14). The weight vector W is determined through a learning process to minimize the sum of squared deviations between the actual and the calculated output, where the sum is taken over all the output nodes in the network. The backpropagation algorithm consists of two phases: feedforward propagation and backward propagation. In forward propagation, outputs are generated for each node on the basis of the current weight vector Wand propagated to the output nodes to generate the total sum of squared deviations. In backward propagation, errors are propagated back, layer by layer, adjusting the weights of the connections between the nodes to minimize the total error. The forward and backward propagation are executed iteratively once for each number of trainings (called epoch) until convergence occurs. The type and topology of the backpropagation network depends on the structure and dimension of the application problem involved, and could vary from one problem to another.
JOURNAL OF DJRECT MARKETING
targeting, optimization, prediction, order commitment, and mail management. The targeting issue is a classification problem with the objective being to partition the customer list into targets, who receive the mail, and nontargets, who do not receive the mail. The optimization problem is concerned with finding the best offer to promote the product, where the term ofer is generalized here to include any promotional variable, or a combination thereof, such as price, brochure type, positioning of the product, payment option, image, and so on. Often, the offer type might have a substantial impact on consumers’ response. Sometimes, different segments of the population receive different offers. The prediction problem has to do with forecasting the number of orders that is expected to be generated by the mailing. It is a very critical problem in DBM, because it provides the basis for product commitment and inventory decisions, and often direct marketing companies do not launch the mailing campaign before they stock u p on merchandise to fulfill customers’ orders. Finally, the ultimate goal of target marketing is to determine which account will be mailed in a given time window, which products, and how many times. This is the mail managementproblem whose purpose is to find the number and mix of products to promote to any given customer to minimize cannibalization and maximize profits. The two core issues in the decision process are the targeting problem (whom to mail to) and the prediction problem (how many orders to expect from the mailing). These two issues provide the
VOLUME 9 NUMBER 3 SUMMER 1995
35
basis for all the other decision problems mentioned previously. Typically, the decision process in DBM is based on the results of a previous mailing to assess customers’ response. In case a new product is involved, where no information whatsoever is available on customers’ response, the decision process often involves a live market test, where a sample from the database is tested for its response to various offers. In the case of a repeated mailing to the same, or a similar product, the previous mailing results are used in lieu of a test to profile the next mailing audience. Confining ourselves to the case of new products only, the decision process often consists of the following steps: 1. Defining a universe of customers to consider for the mailing campaign. Depending upon the application, in some cases the universe consists of the entire customer list, in other cases only a subsection of the list. 2. Choosing a random sample from the universe for the live market test. 3. Building a response model from the test results. 4. Scoring the remaining customers in the database and ranking the customers by their likelihood of purchase. 5. Selecting the customers to roll out to, based on economical considerations, and predicting their expected number of orders.
The main difference among the various methods for targeting audiences in DBM lies in the response model. In the traditional RFM-based segmentation approach such as judgmentally based methods (15) or the automatic tree-structured AID (16), CHAID ( 6 ) , or CART (2) methods, customers’ response rates are estimated by the observed response rate of their peers, that is, people like themselves who have the same purchasing pattern. In the probabilistic logistic regression approach ( l ) ,the customers’ response rates are represented by their purchase probability. In the NN approach, customers’ likelihood of purchase is expressed by means of an NN score which, although it cannot be interpreted as probabilities or response rates, can nevertheless be used to rank customers by their likelihood of purchase: the higher the score, the higher the likeli-
36 JOURNAL OF DIRECT MARKETING
hood. Note that the NN score is bounded by the range of the transfer function that is used to generate the output signal for any node, for example, 0-1 for logistic functions and -1, +1 for hyperbolic functions. But even if the NN scores are confined to the range (0, l),they do not obey the probability rules, and thus cannot be regarded as purchase probabilities. Often, there is a time lag between the live market test and the actual mailing campaign, caused by the lead time to manufacture the product. The longer the time lag, the more complicated the decision process because the response model estimated during the time of the test is applied against a universe drawn from the database at a different point of time which, because of the dynamics of the system, might not necessarily look like the universe used to draw the test audience from. For convenience, we will refer to the test mailing as the Test audience, to the rollout mailing as the Roll audience, to the universe the test is drawn from as the Test Universe, and the one the roll is selected from as the Roll Universe.
4. APPLYING NN TO TARGET MARKETING IN SOLO MAILINGS NN is used in target marketing to model customers’ response. In N N backpropagation models, one tries to explain the purchasing decisions in terms of a myriad of predictors reflecting the customer’s purchasing history, demographics, and lifestyle. In an NN context, each predictor is represented by an input node in the input layer. In the binary yes/no case (purchase/no purchase), one node is required in the output layer to represent the output of the system. Since the expression relating the purchase decisions to the predictors is likely to be a nonlinear one, one or more hidden layers are also required to break the nonlinearity code; the number of hidden layers and the number of hidden nodes in each hidden layer are subject to experimentation. The starting network is, therefore, a fully connected three-layer network, with an input layer consisting of one node for each possible predictor, one hidden layer with several hidden nodes, and an output layer containing one output note.
VOLUME 9 NUMBER 3 SUMMER 1995
Training an NN is based on a learning, or a training sample, which provides the system with examples to learn from. In our target marketing problem, the test audience is used as the learning set. Often, the test audience is too large, and a sample is drawn from the test audience to train the NN. In accordance with N N terminology, we refer to this sample as the learning sample. The purpose of the learning process in NN is to assign weights to the branches connecting any two neurons in the network, which in the case of target marketing should reflect the importance of the predictors in explaining the purchasing decisions. In the backpropagation approach, these weights are often determined to minimize the sum of squared deviations between the calculated output node values and the actual output node values. Based on the network’s topology, and given the characteristics of each customer (i.e., the value of the input nodes), the set of weights emerging from the learning process is applied against the R o l l Universe to calculate the value of the output node for each customer, which we refer to as the NN score. Since the larger the NN score, the better the customer, one can rank customers in decreasing order of their likelihood of purchase, thus placing the better responding customers at the top of the list. Then, the selection process of customers for the mailing amounts to finding the threshold value that separates the targets from the nontargets.
5. ISSUES IN APPLYING NN TO TARGET MARKETING As alluded to earlier, the application of NN to target marketing involves some unique problems that stem from the special characteristics of database marketing. Some of these issues are typical, indeed, to target marketing, but they are magnified when using neural computing to model customers’ response. We discuss several of these issues in this section, making reference to several examples of NN applications discussed in the literature. The first is the well-known exclusive-or (XOR) function, which is a binary function returning the value “true” when only one of its input is true, and “false” otherwise (13).
JOURNAL
OF DIRECT MARKETING
The second is a performance comparison of different learning algorithms, including several NN algorithms, referred to as the Monk study (20), applied on an artificial robot with six attributes, assuming two to four states each, for a total of 432 input configurations. Each combination of input states is artificially assigned to a binary class (yes/ no). Three types of problems were defined in the Monk study that differ by the manner in which the binary classification is defined. The third is a recent study to predict bank failure based on past observations of bank defaults (17), using a set of 19 financial predictors to discern the characteristics of banks that are likely to fail. The purpose of NN in all of these studies is to train a network that can identify the output value for any given input configuration: true/false in the X O R problem, the binary class in the Monk study, and the fail/no fail outcome in the bank study. In addition, we make reference to a few other business applications of NN. Stochastlc Problem
We define a deterministic problem in NN as a problem in which two objects (observations, customers) with the same set of attributes yield exactly the same output, and a stochastic problem as one in which two identical inputs may yield different output results. Most NN applications reported in the literature are deterministic, for example, the X O R problem (13), the first two problems in the Monk study (20), and others. In deterministic problems, a multilayered NN is likely to converge to a stable solution. If the object space is linearly separable, that is, the factors that determine the customer class are simple linear functions of the customer’s attributes, a twolayer NN, consisting of an input layer and an output layer only, is guaranteed to converge to a stable set of weights [the Rosenblatt theorem (12)l. If the system is not linearly separable, which is often the case in practical applications, then one or more hidden layers should be added to the network to break the nonlinearity code. But if the NN is well constructed, it will converge to a stable solution, even though this might require some experimentation with several network topologies to find the architecture that yields the best results. In the Monk study (20) it was shown that even if the system contains slight
VOLUME 9 NUMBER 3 SUMMER 1995
S?
noise (output nodes that were coded as zeros whereas they were supposed to be coded as ones, and vice versa), the network converges to a stable solution; but the Monk system is too small to discern whether this phenomenon can be extended to larger systems. In the DBM targeting case, the problem is highly stochastic, as two identical customers, with exactly the same attributes, are likely to react differently when exposed to the product, with one purchasing the product, the other declining it. These conflicts confuse the NN learning process, sending contradictory signals between the neurons. The network will thus fail to converge to a stable set of weights, and the solution will oscillate indefinitely. As a result, no unique termination condition exists for the learning process that renders the best fit for the data. Several alternatives that yield satisfactory termination rules are demonstrated by means of the following numerical example. Large Number of Potentlal Predlctors In the study on the application of NN to the case of
bank failure predictions (17), the set of predictors involved 19 predefined conventional financial ratios; in another recent article on firm bankruptcy predictions, only five financial ratios were used (21); in the Monks system (201, six predictors were used to identify the robot's position. In all of these cases, and many others, the set of predictors, or attributes, is not only final, but the explaining attributes can also be determined in advance, based on theoretical and practical grounds. But in a DBM problem, the number of possible predictors and their transformations is very large, ranging in the vicinity of several hundred(!). Because many of the predictors are based on the same purchasing history, some redundancy, as reflected by the correlation coefficient between the predictors, occurs, not to mention the noise, measurement errors, missing data, and many other problems which are prevalent when the number of potential predictors is very large. This raises the question as to which set of predictors to use in any particular case. This is the speczjicatzon problem, which is the toughest problem in any largescale regression analysis (5). While not uncommon in business applications, this problem is further compounded in DBM because the predictors explaining the purchasing decisions are likely to vary
38 JOURNAL OF DIRECT MARKETING
from one product to another, requiring that the specification process be carried out anew for each product, at a substantial computational cost. The solution is, often, to conduct the specification process outside the NN environment, and then use the resulting predictors as input nodes to the NN model. lnflnlte Number of States
When only a relatively small number of predictors is involved, the space of the possible predictors combinations is reasonably small. For example, in the Monk study, the space of input configurations consists of 432 possible attribute combinations or states (20). With a bounded attribute space, the learning sample is likely to contain a substantial portion of the attribute combinations (e.g., 20-30% in the Monk study), thus exposing the NN to a good number of possible input states. This will result in a good learning, as the systems sees more different cases. But in the DBM problem, with the large number of potential predictors, the attribute space is infinitely high, especially since many of the predictors involved are continuous. For example, even if we consider a small set of 30 binary predictors, the attribute space contains z 3 O sz 1,000,000,000 possible states, which is an excessively large number. With such a large object space, the learning sample, as large as it may be, contains only a minute proportion of the possible input configurations. The result could be a poor learning, as the NN is exposed to very few examples, often less than one millionth of the possible number of cases. Low Response Rate
Clearly, the larger the number of purchases in the learning sample, the better the learning process, since with more buyers presented to the system, the NN is better equipped to discern the characteristics of the buyers and discriminate between the yes and no responses. But, unfortunately, the proportion of buyers in DBM is typically very low, often less than one percent. In comparison, the respondent rate in most other business problems is much higher. For example, Tam and Kiang (17) used 59 failed Texas banks of the 400 bankruptcy cases filed in 1991 in Texas in the learning sample, a proportion of almost 15 percent; the Monk study (20) was conducted on two problems that contain 50 percent
VOLUME 9 NUMBER 3 SUMMER 1995
of ones, and a third problem containing one third of ones (a one meaning that the input configuration belongs to the class). These rates far are higher than any rates encountered in the database marketing industry. Coupled with the stochastic effects mentioned before, and the fact that the attribute space is very large, the low response rate further limits the capacity of the NN to separate the likely buyers from the likely nonbuyers. One way to handle the low response rate is by including a larger proportion of buyers in the learning sample than in the population. For example, in the bank failure study (17), the learning sample contained 50 percent failed banks (yesrespondents) and 50 percent nonfailures ( n o respondents). But in the DBM case, with the proportion of buyers in the population being so low, superficially increasing the number of buyers relative to nonbuyers will distort the learning sample, giving heavier representation to buyers, thus resulting in more customers being classified as targets. This is one of the outcomes of the “overfitting” problem, which is also prevalent in large regression models, which results in “too good a fit” for the training set, but a poorer fit when applied against a new set of observations. Tlmlng Issues In NN, examples drawn from previous observations
are typically used for training, with the learning results applied to future observations for decision making. If the time gap between the past and the future is long, the underlying attribute distribution might change, with a likely effect on the accuracy of the results. In most business applications solved by NN, events occur on a continuous basis (e.g., bank failures, bankruptcies, etc.), allowing one to adjust the learning sample by bringing in new examples, and perhaps discarding old ones, in a kind of a “sliding-window” scheme. This way, the NN allows for adaptive adjustment in the prediction model as new examples become available. Perhaps this is the reason of the success of NN in time-series forecasting reported in the literature (e.g., 19). Unfortunately, this is not the case in DBM, as the previous mailing and the next mailing often take place at two different points in time, as long as a year apart, thus ruling out the possibility of applying the sliding-window scheme to respond to time vari-
JOURNAL OF DIRECT MARKETING
ations. As a result, the learning sample does not necessarily reflect the conditions taking place at the time of the next mailing, affecting both the model selection decisions and the prediction accuracy. Scorlng the Output Nodes
In a dichotomous classification problem, as our problem is, the output node takes on the value of 0 or 1: 1 if yes, 0 if no. But in the learning process, the output score will rarely, if at all, be 0 or 1, and the question is how the output node be defined, as a yes or no? In the DBM application, this question corresponds to the targeting problem, that is, either to include the customer in the mailing (if the output node value is high) or to exclude the customer from the mailing (if the output node value is low). Tam and Kiang (17) adopted the procedure that if the output score is greater than .5, the output node is set to 1, otherwise it is set to 0. The Monk study has a different way of setting the output nodes based on a certain combination of the input variables. However, neither method is feasible in the DBM case. Because there are so many more nonbuyers than buyers in the learning sample, the output score for a buyer is rarely greater than .5; in all likelihood, it will be closer to 0. Thus, the .5 criterion will classify almost all customers as nontargets. Of course, one could use a different threshold value, say .l, to define the output node (>.l * target, otherwise, nontarget), and validate it retrospectively by means of a backtesting process. But the question then is, how good will this threshold be for selecting future mailings? Could we use one threshold value for all products or should each product have a different threshold value of its own? An alternative approach for targeting audiences in an NN environment is to draw on the learning results and use the same cutoff point that attains the best fit for the learning sample as the cutoff point to pick the promotable customers from the Roll Universe. For example, if the best fit for the learning sample was obtained for the 62nd percentile, the top 62 percent of the Roll Universe is mailed. The fit measure, as a function of the audience mailed, could be expressed in terms of the percentage of buyers captured per percentage of audience mailed, or in terms of the resulting net profits involving economical considerations, or by any other appropriate measure. As is demonstrated in the following
VOLUME 9 NUMBER 3 SUMMER 1995
39
numerical example, drawing on the test results might not necessarily yield optimal solutions, but at least it has the advantage of yielding a productspecific cutoff rule. Evaluating the Results A related problem to the previous one is how to evaluate the classification accuracy of the N N results. When the output node can be scored as a yes
or no, one could use cost and profit criteria, or misclassification rates, to evaluate results. For example, Tam and Kiang (17) defined the Type I and Type I1 classification errors, where the Type I error is defined on “the event of assigning an observation to Group 1 that should be in Group 2, while the Type I 1 error involves assigning an observation to Group 2 that should be in Group l”, and used the expected cost of misclassification, which they referred to as the resubstitution risk, to compare the NN results against the results of several other alternative models. Again, none of these approaches can work in our problem because there is no unique way of converting the output score into a definite yes or no. Predlctlng the Number of Orders
Accurate predictions of the number of orders expected to be generated by the mailing play an important role in maximizing the company’s net profits, and optimizing offer type and order commitment decisions. Predictions in DBM are usually based on customers’ predicted response rate. But NN scores are ordinal measures, used only to rank customers by their likelihood of purchase for the purpose of separating out the targets from the nontargets. Hence, NN scores cannot be directly used to predict the number of orders of a mailing audience, unless converted to some type of response, or probability. Since, as discussed later, NN scores bear no theoretical meaning other than ranking customers by their likelihood of purchase, one needs to apply a heuristic approach to convert the NN scores into NN probabilities. The idea is to normalize the NN results in the learning process to equate the sum of the NN scores to the actual number of orders generated by the test audience using either a linear or a nonlinear normalization method. But, as no one knows the statistical properties of these coalesced
10 JOURNAL OF DIRECT MARKETING
probability estimates, only empirical research with realistic data can tell how reliable these estimates are, and how well a job they do in predicting the number of orders expected to be generated by the mailing campaign. Interpreting the NN Results
Finally, while not specific to DBM, the interpretation of the NN results, or the lack of, is especially confounded when it comes to applying NN to DBM. Other than justifying the NN results on the basis of a track record, there is no other way to explain or defend the NN’s decision rationale and results. Statistical models, for example, provide significance tests and performance measures to evaluate results; in expert systems one can trace the reasoning chain or invoke some sort of a belief calculus to interpret the results. NN, on the other hand, suffers from low face validity, as NN decisions are supported neither by significance tests nor by deductive knowledge. As a result, one is not able to discern the more important nodes from the less important nodes, or to assess the relative importance of an input node, or to explain why a certain topology works better than another, or to evaluate the credibility of the results. And with no analytic rationale of the NN results, one cannot be sure of the targeting and prediction results, or guarantee that the NN will not make a weird decision at some point in the future, or take measures to avoid such a situation from occurring. NN, therefore, lacks the accountability that is so essential to sell the idea to the decision makers, making it more difficult to implement an NN-based system and put it to work.
6. NUMERICAL EXAMPLE
To exemplify the difficulties of applying neural computing to target marketing, we consider the issue of finding the termination rule in a test-to-roll process for an actual problem involving realistic data. The termination rule is concerned with the issue of determining how many learning cycles (also, repetitions or iterations) to conduct, before terminating the learning process. Unfortunately, because of some of the problems discussed previously, no
VOLUME 9 NUMBER 3 SUMMER 1995
unique criterion exists for the termination rule; yet, it is a core problem in applying NN to target marketing, affecting both the quality of the results, as well as the number of computations involved. The network used for the training process was determined based on an extensive experimentation process with alternative network topologies and search grids. The resulting network consists of one input layer, two hidden layers, one output node, and one bias node added to each of the layers. The input layer exhibits the attributes, or predictors, which explain the purchase decisions, one node per prediction; the output node, the yes/no (purchase/no purchase) decisions. Usually, the set of initial predictors is very large, often in the order of magnitude of several hundred potential variables, and a specification process is required to select the most appropriate set of predictors to initiate the NN model. In this study we experimented with four sets of predictors. 1. The first, with 21 predictors, referred to as the AMOS set, was optimally selected using an automatic specification model (7) to yield a set of predictors that best explains, in a statistical sense, the customers’ purchase decision. 2. The second, with 29 variables, referred to as the FIXED set, is a prespecified judgmentally based set of predictors, which based on a previous knowledge and experience in the field, are likely to be good predictors. The third, AMFIX, with 45 variables, is a union 3. of the AMOS and the FIXED sets. 4. The last, NNFIX, with 48 predictors, is a collection of variables which, based on some initial results with the NN model, seems to be the most influential predictors.
To allow us to evaluate the results, the example was conducted in a backtesting mode, for a product that was promoted in the past, for which both the test and the roll results are known. The learning sample was drawn from the actual test audience file (TAF), and the validation sample from the actual roll audience file (RAF). The TAF and the RAF were drawn from their corresponding universes which, due to the lead time to manufacture the product, were almost a year apart. Care was taken to make sure that the two universes contained similar kinds
JOURNAL OF DIRECT MARKETING
of people, that is, people belonging to the same RFM -based segments. Nevertheless, we note that in a backtesting environment, the TAF is not necessarily compatible with the RAF, as the test audience is a random sample of the entire universe, whereas the roll audience is an already targeted audience of the Roll Universe, excluding people that the test results indicate are not worth mailing to. To place the two audiences on the same footing for comparison purposes, we make the assumption, which is based on our extensive experience in the field, that the best 50 percent of people in the test audience are equivalent to the best 80 percent of people in the roll audience. Consequently, the learning results are evaluated at the 50th percentile of the test audience mailed, whereas the validation results are evaluated at the 80th percentile of the roll audience mailed. The performance of the N N model is measured in terms of the percentage of buyers captured as a function of the percentage of audience mailed. Clearly, the higher the percentage of buyers captured per a certain audience level, the better the fit and the better the performance of the model. Since in real life, only the test results are known, we seek the number of repetitions that yield the best fit at the 50th percentile of the test audience. To evaluate the consistency of this test-based termination rule, we compared it to the corresponding termination rule that yields the best fit for the 80th percentile of the roll audience. Table 1 exhibits some interesting cases for the AMFIX predictor set. Each case in Table 1 corresponds to a variation of the basic network involving a different search grid. The results were recorded at multiples of 100,000 learning cycles, nine multiples in all, for a total of 900,000 iterations. Case A: The fit results are a unimodul function of the number of repetitions for the 50th percentile of the test, but a decreasing function for the 80th percentile of the roll. Case B: The termination rule that yields the worst fit for the test audience is the one that yields the best fit for the roll audience. Case C: The termination rule that yields the best fit for the test audience is also the one that yields the best fit for the roll audience.
VOLUME 9 NUMBER 3 SUMMER 1995
41
TABLE 1
Fit Results for Several Cases Involving the AMFIX Variable Set Number of ReDetitions fin thousands1 100
200
300
400
500
600
700
800
900
Roll
74.3 96.2
81.6 96.3
82.4 96.0
82.4 95.9
82.4 95.9
83.I 95.9
82.4 95.7
82.4 95.7
82.4 95.7
Case 13 Test Roll
77.9 96.4
84.6 96.0
83.1 95.8
83.1 96.0
82.4 95.9
83.1 95.9
83.1 95.9
83.1 95.9
83.1 95.9
Case C Test Roll
83.8 95.8
85.3 96.3
82.4 95.1
83.1 95.3
83.1 94.8
83.8 94.8
83.8 94.7
83.8 94.7
83.8 94.7
82.4 95.8
82.4 95.0
86.0 95.0
86.0 94.7
86.0 94.7
88.2 94.8
87.5 94.7
87.5 94.7
87.5 94.7
83.I 95.9
83.1 96.0
83.8 96.0
85.3 95.3
85.3 95.2
85.3 95.2
86.0 95.0
86.0 95.0
86.0 95.0
82.4 94.8
83.8 94.8
86.8 94.8
86.8 94.6
86.8 94.6
87.5 94.7
89.0 94.8
89.0 94.8
88.2 94.8
Case A Test
Case D Test
Roll Case E Test
Roll Case F Test Roll
Note. Average fit results in percentages of buyers captured for the Test (50th percentile)and Roll (80th percentile)
Case D: The termination rule that yields the best fit for the test audience does not yield the best fit for the roll audience. Case E: The fit results are an increasingfunction of the number of repetitions for the test audience but a decreasing function for the roll audience. Case F: Both the worst and the best termination rules for the test audience yield the best fit for the roll audience. Clearly, from a practical point of view, one would have liked all cases to behave like Case C, where the termination rule that yields the best fit for the test is also the one that yields the best fit for the roll. But, as the preceding examples indicate, this is not necessarily the case. On the contrary, sometimes the worst termination rule in the test audience is the best for the roll audience (Case B), and vice versa, the best termination rule in the test audience yields a relatively inferior fit when applied to the roll audience (Case D). Consequently, it appears that there is no universal termination rule that is
42 JOURNAL OF DIRECT MARKETING
optimal in the sense that it yields the best fit when applied on both the test audience and the roll audience. To reconcile the differences in the fit results between the test and the roll, one could try to find a compromise solution using, say, a regret matrix approach. We define the regret associated with a termination rule in terms of the percentage of additional buyers that we could have captured at the 80th percentile of the roll audience if we had known the optimal termination rule that yields the best fit. For example, if for 100,000repetitions one captures 9 5 . 5 percent of the buyers at the 80th percentile of the roll audience, whereas the optimal fit for the problem occurs at 600,000 iterations capturing 96.2 percent of the buyers, the regret value associated with the 100,000iteration is 96.2 percent minus 95.5 percent, or 0.7 percent of the buyers. Table 2 presents the regret matrix for all networks involving the AMFIX variable set with each observation corresponding to a given variation of the network; Table 3 summarizes the average regret values over all networks for the four sets of predictors used
VOLUME 9 NUMBER 3 SUMMER 1995
TABLE 2
Regret Matrix for the AMFIX Variable Set Number of Roll Repetitions (in thousands)
Average Fit’
Max Fit Observation
Test
Roll
Rest
100
200
300
400
500
600
700
800
900
0.1
0.1
0.1
1
81.7
95.8
0.1
1.6
0
0
0.4
0.5
0.4
2
81.5
95.9
0.6
0.1
0
0.2
0.4
0.4
0.4
0.6
0.6
0.6
3
82.6
96.0
0.5
0
0.4
0.6
0.4
0.5
0.5
0.5
0.5
0.5
4
85.9
94.9
1.1
0
0.8
0.8
1.1
1.1
1 .o
1.1
1.1
1.1
5
83.8
95.6
0.4
0
0.2
0.6
0.4
0.4
0.7
0.8
0.8
0.8
6
84.9
95.4
1.1
0.1
0
0
0.7
0.8
0.8
1.1
1.1
1.1
86.7
94.8
0
0
0
0
0.2
0.2
0.1
0
0
0
7
8
83.7
95.6
0.1
0.4
0
1.2
0.7
0.5
0.7
0.7
0.8
1 .o
9
84.6
94.6
1.4
0
0.8
1.4
1 .7
I .6
1.7
1.4
1.4
1.4
10
80.6
96.1
0.4
0
0.4
0.4
0.6
0.6
0.6
0.5
0.4
0.4
11
81.9
96.0
0
0.1
0.4
0
0
0.4
0
0.1
0.1
0.1
12
81.9
96.1
0.6
0
0.6
1.2
0.5
0.5
0.5
0.5
0.6
0.6
13
82.8
96.0
0.2
0.5
0.2
0
1 .o
1 .o
0.1
0. I
0.2
0.2
14
84.7
95.4
0.8
0.5
0
0.6
1.2
1.2
1 .o
0.8
0.8
0.8
15
83.3
95.5
1.3
0
0.1
0.8
0.5
0.7
0.7
I .o
1.1
1.3
16
83.7
95.4
1.2
1 .o
0.5
0.8
0
1.3
1.3
I .4
1.2
1.2
17
84.2
94.8
0.5
0.2
0.1
0
0.5
0.2
0.7
0.6
0.5
0.5
18
83.7
95.1
1.2
0.5
0
I .2
1 .o
1.4
1.6
1.6
1.6
1.6
Average
83.5
95.5
0.7
0.3
0.3
0.6
0.6
0.7
0.7
0.7
0.7
0.7
’Average fit results in percentages of buyers captured for the Test (50th percentile) and Roll (80th percentile]audiences.
which correspond to the best number of repetitions in the Test. Clearly, a regret value of zero in this column means that the optimal termination rule for the Test also yields the best fit for the Roll. The results are astonishing! They indicate that
in the study. Columns 2 and 3 in Tables 2 and 3 provide the average fit results, expressed in terms of the percentage of buyers captured, for the Test and the Roll, respectively; the fourth column entitled, Max Fit Test, specifies the roll regret values, TABLE 3 Summary of Average Regret Values by Sets of Predictors
Number of Roll Repetitions (in thousands)
Average Fit’ Variable Set
Test
Roll
Max Fit Test
AMFIX
83.5
95.5
0.7
AMOS
80.2
96.2
0.4
0.4
0.2
0.4
0.4
0.3
0.3
0.3
0.3
0.3
FIXED
77.6
95.6
0.5
0.4
0.3
0.5
0.4
0.4
0.5
0.6
0.6
0.6
NNFIX
79.5
94.4
I .5
0.4
0.3
0.9
1.1
1.2
1.4
1.6
1.6
1.6
Grand Average
80.2
95.5
0.8
0.4
0.3
0.6
0.6
0.7
0.7
0.8
0.8
0.8
100
200
300
400
500
600
700
800
900
0.3
0.3
0.6
0.6
0.7
0.7
0.7
0.7
0.7
Average fit results in percentages of buyers captured for the Test (50th percentile) and Roll (80th percentile] audlences
JOURNAL OF DIRECT MARKETING
VOLUME 9 NUMBER 3 SUMMER 1995
43
the intuitively appealing termination rule that seeks the best fit for the test audience is not necessarily the optimal one from a regret matrix point of view; on the contrary, at least for the problem addressed in this study, it appears to be the worst, as indicated by the many large, often highest, regret values in the Max Fit Test column. By Table 3, the best compromising termination rule seems to be 200,000 iterations because it minimizes the average regrets for all the predictor sets. This conclusion was also obtained for a few more products that we experimented with.
7. CONCLUSIONS
In this article we have discussed several unique problems encountered when applying N N to target marketing, unparalleled in other business applications of NN, which may affect both the selection and the prediction decisions. As a result, the application of NN to target marketing is not direct, and definitely not trivial, and may require a good degree of experimentation and trial and error analysis. These problems were exemplified by seeking the termination rule for a backpropagation NN model, applied to a realistic solo mailing problem, for which it was shown that no unique termination rule actually exists that yields fit results consistent for both the test and the roll audiences, requiring one to seek a compromise solution instead. The main advantage of using NN for target marketing over, say, regression-based statistical methods, is that it is a straightforward approach, requiring no a priori model formulation or assumptions. One can, therefore, use NN as a first attempt to modeling response in order to get a feel for the promotable customers to look after. Furthermore, as we discuss in a subsequent paper ( 8 ) , the NN results seem to be robust across various network configurations, and by averaging out the results over several starting points, one can minimize the risk of attaining a poor fit due to accidentally selecting a wrong network. This makes NN a viable alternative to statistical networks in modeling response. Another possibility, (also discussed in 8>, is to use NN alongside a regression-based model, thus utilizing the benefits of either model to tighten the mailing even further.
44
JOURNAL OF DIRECT MARKETING
We also note that the discussion in this article was confined to backpropagation N N models only, as applied to the targeting problem in solo mailings. But there might be other target marketing problems that are more amenable to N N computing, or other N N models that are easier to apply than the backpropagation models discussed here. N N is a new field and the use of N N in direct marketing is even newer, and further research is required, empirical as well as theoretical, to explore the feasibility of using N N for target marketing. It is still too early to predict either success or failure for the use of N N in direct marketing.
REFERENCES 1. Ben Akiva, M., a n d Lerman, S. R. (1985), Discrete Choice Analysis, Cambridge, MA: MIT Press. 2. Breimdn, L., Friedman, J., Olshen, R., and Stone, C. (1984), ClassiJcation a n d Regression Trees, Belmont, CA: Wadsworth. 3. Dutta, S.,and Shekkar, S. (1988), “Bond Rating: A Non-Conservative Application of Neural Networks,” Proceedings of the IEEE International Conference on Neural Networks, 4 . Gallant, S. (1988), “Connectionist Expert System,” Communications of tbe ACM, 5 . Judge, G. G., Griffith, W. E . , Hill, R. Carter, Lutkepohl, H., a n d Lee. T. (1980), The Theory a n d Practice of Econometrics, New York: John Wiley a n d Sons. 6. Kass, G. (1983), “An Exploratory Technique for Investigating Large Quantities of Categorical Data,” Applied Statistics, 29. 7 . Levin, N., a n d Zahavi, J . (1992), “AMOS-Automatic Model Specifications for Binary Choice Models,” Working Paper No. 25/92, Faculty of Management, Tel Aviv University. 8. Levin, N., and Zahavi, J. (1994), “Applying Neural Computing to Target Marketing,” Working Paper No. 22/94, Faculty of Management, Tel Aviv University. 9. Nygard, K., Juell, P., a n d Kaddaba, N. (19901, “Neural Networks for Selecting Vehicle Routing Heuristics,” ORSA Journal of Computing, 2 . 10. Roberts, M. L., and Berger, P. D. (19891, Direct Marketing Management, Englewood Cliffs, NJ: Prentice Hall. 11. Rochester, J . (Ed.), (1990), New Business Uses for Neuro Computing. 12. Rosenblatt, F. (1962). Principles of Neurodynamics, New York: Spartan Books. 13. Rumelhart, D. E., a n d McClelland, J . L. a n d t h e PDP R e search Group (Eds.), (1986), Parallel Distributed Processing: Exploring the Microstructure of Cognition, Cambridge, MA: MIT Press. 14. Rumelhart, D. E., McClelland, J . L., and Williams, R . J. (1986), “Learning Internal Representation by Error Propagation” in J. L.
VOLUME 9 NUMBER 3 SUMMER 1995
McClelland and D. E. Rumelhart, and the PDP Research Group Eds., Parallel Distributed Processing:Explorations in the Microstructure of Cognition, Cambridge, MA: MIT Press. 15. Shepard, D. (1990), The New Direct Marketing Homewood, IL: Dow Jones-Irwin. 16. Sonquist, J., Baker, E., and Morgan, J. N. (1971), “Searching for Structure,” Survey Research Center, University of Michigan, Ann Arbor. 17. Tam, K. Y.,and Kiang, M. Y. (1992), “Managerial Applications of Neural Networks: The Case of Bank Failure Predictions,” Management Science, 38.
JOURNAL OF DIRECT MARKETING
18. Tanaka, T., Canfield, J., Oyanagi, S., and Genchi, H. (1989), “Optimal Task Assignment Using Neural Networks,” Washington, DC: International Joint Conference on Neural Networks. 19. Tang, Z.,de Almeida, C., and Fishwick, P. (1991), “Times Series Forecasting Using Neural Networks vs. Bok Jenkins Methodology,” Simulation, 57. 20. Thurn, S. B., et al., (1991), The Monk’s Problem: A Pegormance Comparison of Different Learning Algorithms, Carnegie Mellon University, Pittsburgh, PA. 21. Wilson, R. I., and Sharda, R. (1993), “Bankruptcy Prediction Using Neural Networks,” Decision Support Systems.
VOLUME 9 NUMBER 3 SUMMER 1995 45