Improving direct mail targeting through customer response modeling

Improving direct mail targeting through customer response modeling

Accepted Manuscript Improving direct mail targeting through customer response modelling Kristof Coussement , Paul Harrigan , Dries F. Benoit PII: DOI...

8MB Sizes 5 Downloads 94 Views

Accepted Manuscript

Improving direct mail targeting through customer response modelling Kristof Coussement , Paul Harrigan , Dries F. Benoit PII: DOI: Reference:

S0957-4174(15)00456-X 10.1016/j.eswa.2015.06.054 ESWA 10131

To appear in:

Expert Systems With Applications

Received date: Revised date: Accepted date:

20 May 2014 3 April 2015 30 June 2015

Please cite this article as: Kristof Coussement , Paul Harrigan , Dries F. Benoit , Improving direct mail targeting through customer response modelling, Expert Systems With Applications (2015), doi: 10.1016/j.eswa.2015.06.054

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

HIGHLIGHTS

CR IP T

ACCEPTED MANUSCRIPT

Data-mining algorithms (CHAID, CART and neural networks) perform best.



Logistic regression and linear discriminant analysis are statistical alternatives.



Quadratic discriminant analysis, naïve Bayes, C4.5 and k-NN algorithm perform badly.

AC

CE

PT

ED

M

AN US



CR IP T

ACCEPTED MANUSCRIPT

IMPROVING DIRECT MAIL TARGETING THROUGH

AN US

CUSTOMER RESPONSE MODELLING

Kristof Coussement (corresponding author)

IESEG School of Management – Université Catholique de Lille (LEM, UMR CNRS 9221), Department of Marketing, 3 Rue de la Digue, F-59000, Lille (France). [email protected], +33 320545892 Paul Harrigan

M

The University of Western Australia - UWA Business School, M263, 35 Stirling Highway, Crawley, WA 6009

ED

[email protected], +61 8 6488 1979 Dries F. Benoit

Faculty of Economics and Business Administration, Ghent University, Tweekerkenstraat 2,

AC

CE

PT

B-9000, Ghent (Belgium)

[email protected], +32 9 264 3552

CR IP T

ACCEPTED MANUSCRIPT

ABSTRACT

Direct marketing is an important tool in the promotion mix of companies, amongst which direct mailing is crucial. One approach to improve

AN US

direct mail targeting is response modelling, i.e. a predictive modeling approach that assigns future response probabilities to customers based on their history with the company. The contributions to the response modelling literature are three-fold. First, we introduce well-known statistical and data-mining classification techniques (logistic regression, linear and quadratic discriminant analysis, naïve Bayes, neural networks, decision trees, including CHAID, CART and C4.5, and the k-NN algorithm) to the direct marketing community. Second, we run a predictive benchmarking study using the above classifiers on four real-life direct marketing datasets. The 10-fold cross-validated area under the receiver operating characteristics curve is used as evaluation metric. Third, we give managerial insights that facilitate the classifier choice based on the

M

trade-off between interpretability and predictive performance of the classifier. The findings of the benchmark study show that data-mining algorithms (CHAID, CART and neural networks) perform well on this test bed, followed by simplistic statistical classifiers like logistic

ED

regression and linear discriminant analysis. It is shown that quadratic discriminant analysis, naïve Bayes, C4.5 and the k-NN algorithm yield poor performance.

PT

KEYWORDS

AC

CE

Direct marketing, Direct mail, Response modelling, Database marketing

CR IP T

ACCEPTED MANUSCRIPT

1. INTRODUCTION

The move from mass-marketing to mass-customization is no better reflected than in the area of direct marketing, and in particular direct

AN US

mail. Marketers no longer distribute their messages to a mass market, nor do they distribute based on basic demographic characteristics; rather they distribute and optimize different messages to different segments that are developed based on past behavior (Rowe 1989, Jonker et al. 2004; Wierich et al. 2014). Still, the need to improve the effectiveness of direct mail campaigns is a persistent issue in many industries (Guido et al. 2011; Mahdiloo et al. 2014).

Before sending direct mail, a key dilemma for marketers is which customers to target. In an effort to answer this question, marketers tend to

M

use response modelling. Response modelling identifies customers that are likely to respond better to the marketing campaign based on their past response behavior.

ED

The above perfectly fits in the philosophy underpinning one-to-one marketing communications seen in the customer relationship management (CRM) domain (Mahdiloo et al. 2014). CRM is a strategic approach to marketing underpinned by relationship marketing theory (Morgan and Hunt 1994), which has been defined as ―a comprehensive strategy and process that enables an organization to identify, acquire,

PT

retain and nurture profitable customers by building and maintaining long-term relationships with them‖ (Sin et al. 2005, p. 1266). At the heart of CRM is data on customers. The increasing power of CRM technologies enables more and more sophisticated data collection, storage and

CE

analysis techniques. The ability to draw powerful analyses from customer data makes CRM – and thus response modelling – a critical success

AC

factor in today‘s rapidly changing environment (Danaher and Rossiter 2011, Kumar 2008, Ngai et al. 2009).

CR IP T

ACCEPTED MANUSCRIPT

The focus of this paper is customer response modelling. The contributions of our research study are three-fold. First, we will introduce the most popular response modelling methods to the direct marketing community. In particular, we review a range of popular classification algorithms borrowed from the statistical and data-mining community (logistic regression, linear and quadratic discriminant analysis, naïve

AN US

Bayes, neural networks, decision trees (CHAID, CART and C4.5) and the k-NN algorithm). Second, we complement the existing response modeling literature by integrating and contrasting all classification algorithms into a framework that aims to benchmark their predictive capabilities in discriminating responders from non-responders in four real-life direct mail companies. Third, managerial insights on classifier choice are given to the direct marketing community taken into account the comprehensibility and predictive performance of the response model. This paper is structured as follows. The next section introduces the direct marketing field and its links with customer response modelling.

M

Following that, the range of classification algorithms are introduced and explained. We then describe the evaluation metric used, and further explain the characteristics of the datasets and the experimental setting. Finally, we present the results and their implications.

ED

2. DIRECT MARKETING & RESPONSE MODELLING Direct marketing is defined as the ‗interactive system of marketing which uses one or more advertising media to affect a measurable

PT

response and/or transaction at any location‘ (Direct Marketing Association 2009). Direct marketing is big business. It is projected that direct marketing expenditures in the US will grow to $196 billion in 2016, with direct mail forming part of this growth. Direct mail is targeted at

CE

customers that are most likely to be enticed by particular offers, as opposed to a traditional mass marketing approach whose promotional

AC

activities are addressed to customers and prospects indistinctly (Guido et al. 2011; Mahdiloo et al. 2014; Risselada et al. 2014). Direct mail is not

CR IP T

ACCEPTED MANUSCRIPT

being killed off by the Internet; rather it is being used as a complementary channel (Danaher and Rossiter 2011). Winterberry Group confirms that direct mail is still on the rise (Conlon 2015). In 2014, direct mail spending grew with 2.7% in the United States compared to the projected 1.1% growth. Moreover, the market analysts project a 1% growth increase in direct mail spending for 2015, equivalent to $45.7 billion of the

AN US

$156.8 billion representing the total direct and digital spending projection for 2015. The reason by Winterberry group is that direct mail costs will stay steady, and thus they expected that the projected 1% growth to come from volume increases.

Continued growth will be predicated upon the levels of return on investment of direct mail campaigns, which significantly depends on marketers being able to use specialized targeting techniques to come up with the right set of customers to contact (Lamb et al. 1994).

M

Thus, the importance in knowing which customers are more likely to respond to a certain mailing is of paramount importance to marketers. Determining or predicting those customers who have a high probability to respond to a specific mailing based on their past behavior is called customer response modelling (Bose and Chen 2009; Mahdiloo et al. 2014). Response modelling is part of the classification literature stream.

ED

Classification is the procedure where customers are predicted to belong to predefined groups or target classes based on their historical customer information (Blattberg et al. 2008). Typically, a response model is estimated on a training set in which both the independent variables, describing

PT

and profiling a particular customer, and the dependent (response) variable, whether the customer responded on a certain mailing, are observed. Then, the estimated model on the training data is applied to a new set of customers that are not used during training (the test set). The result is a

CE

response probability for each customer in the test set, dependent on his or her past behavior. Managerially-speaking, depending on the direct

AC

mail campaign budget, the company is able to target the top x% of customers with the highest response probability given by the response model.

CR IP T

ACCEPTED MANUSCRIPT

The next section of this paper will introduce and describe the range of statistical and data-mining algorithms that can be used in customer response modelling.

AN US

3. CLASSIFICATION ALGORITHMS

The essence of one-to-one marketing communication is providing the right customers with marketing messages that they can easily act on (Ryals 2005) This means that ‗prediction and targeting are both key to decision making underlying direct marketing campaigns‘ (Zahavi and Levin 1997, p.35). Therefore, understanding which techniques yield the best predictive capabilities is vital for direct marketers (Rada 2005, Bose and Chen 2009). With increased efficiencies and effectiveness, marketers could reduce mailing costs (Barwise and Farley 2005), increase

M

conversion rates (Kaefer et al. 2005), and increase customer retention (Watjatrakul and Drennan 2005).

Our literature review reveals that existing literature utilizes several statistical and data-mining classification algorithms in various

ED

research setups to separate responders from non-responders. However, we complement academic literature by presenting and integrating the most popular classifiers into one predictive benchmark study over multiple response datasets, while summarizing the managerial implications for managers. Several statistical classification methods to predict customer responses have been proposed and utilized, such as logistic regression,

PT

discriminant analysis and naïve Bayes (Baesens et al. 2002, Berger and Magliozzi 1992, Coussement et al. 2014, Cui et al. 2010, Deichmann et al. 2002, Kang et al. 2012, Lee et al. 2010). These techniques can be very powerful, but each algorithm also make several stringent, but different,

CE

assumptions on the underlying distribution between the independent variables and the dependent variable. To counter this, more advanced data-

AC

mining algorithms have been proposed for discriminating between responders and non-responders, such as artificial neural networks (Baesens et

CR IP T

ACCEPTED MANUSCRIPT

al. 2002, Chen et al. 2011, Curry and Moutinho 1993, Zahavi and Levin 1997), decision tree-generating techniques (Buckinx et al. 2004, Chen et al. 2012, Haughton and Oulabi 1997, McCarty and Hastak 2007, Rada 2005) and k-NN learners (Govindarajan and Chandrasekaran 2010, Kang

AN US

et al. 2012).

The following sections review the most popular response models by describing their functioning, and by discussing their merits and drawbacks. 3.1.Logistic Regression

Logistic regression (LOG) is a well-known and industry-standard classification technique for predicting a dichotomous dependent

M

variable such as respond/do not respond to a mailing (Coussement et al. 2014, Suh et al. 1999). Besides applications in direct marketing, it is an often used technique in a variety of predictive business settings like customer segmentation (McCarty and Hastak 2007), churn prediction

ED

(Neslin et al. 2006), customer choice modelling (West et al. 1997) and many others. Moreover, logistic regression has several advantages (Hosmer and Lemeshow 2000).

For a given training set with N labeled training examples (xi,yi)} with i=1,2,…,N with input data xi є Rn and corresponding binary target

PT

labels yi є {0,1}, the logistic regression tries to estimate the probability P(y=1|x) given by

AC

CE

P(y  1 x) 

1

1  exp(-(w0  w x))

(1)

CR IP T

ACCEPTED MANUSCRIPT

with x є Rn equals to a n-dimensional input vector, w equals to the parameter vector and w0 equals to the intercept. The parameters w0 and w are usually estimated using a maximum likelihood procedure (Hosmer and Lemeshow 2000).

AN US

3.2. Linear and Quadratic Discriminant Analysis

Among business professionals and researchers, discriminant analysis is one of the most popular statistical techniques. It is applied in different prediction settings including customer cohort prediction (Noble and Schewe 2003), consumer choice modelling (West et al. 1997) and obviously in direct marketing (Bult 1993). Discriminant analysis is often used by database marketers. It is a technique that identifies variables that explain the difference among groups (i.e. responders/non-responders), and that is able to classify customers into the previously defined

M

groups. Moreover, it is considered as a valid alternative for a traditional logistic regression.

Practically, discriminant analysis involves deriving combinations of independent variables that will discriminate well between the a priori

ED

defined groups of responders/non-responders. The weights or discriminant coefficients indicate the importance of the independent variables and they are estimated in such a way that the misclassification error rates are minimized or the between-group variance relative to the within-group variance is maximized (Blattberg et al. 2008). When the decision boundary that separates the responders from the non-responders is non-linear

PT

quadratic in the input space, the discriminant analysis is called a quadratic discriminant analysis (QDA). In contrary to quadratic discriminant analysis, linear discriminant analysis (LDA) assumes that the decision boundary which separates the responders from the non-responders is

AC

CE

linear in the input space.

CR IP T

ACCEPTED MANUSCRIPT

Discriminant analysis assigns an observation x to the class yi є {0,1} having the largest posterior probability p(y|x). Applying the Bayes‘ theorem in a binary classification context results in: Assign an observation x into class 1 (response) if (2)

AN US

p(x|y1) P(y1) > p(x|y0) P(y0)

with p(x|y1) and p(x|y0) the class-conditional distributions of class 1 and class 0 respectively, and P(y1) and P(y0) the prior probability of class 1 and class 0.

The densities p(x|y1) and p(x|y0) in equation (2) are unknown and have to be estimated from the training data. When one assumes that the distribution of the data is multivariate Gaussian, the observation x is classified in class 1 when

M

(x  μ1 ) T 11 (x  μ1 )  (x  μ 0 ) T  01 (x  μ 0 ) <

(3)

ED

2 (ln(P(y1)) - ln(P(y0))) + ln  0 - ln  1

with µ1 and µ0 the mean vectors of class 1 and class 0, and ∑1 and ∑0 the covariance matrices of class 1 and class 0. The presence of the

PT

quadratic terms x T ∑11 x and - x T  01 x indicates that the decision boundary is quadratic in x and therefore this classification technique is called quadratic discriminant analysis (QDA). However, linear discriminant analysis (LDA) makes the extra assumption that 1   0   which

CE

makes that the quadratic terms x T 11 x and - x T  01 x cancel and therefore the classification rule becomes linear in x. More technical details

AC

and visual representations on linear and quadratic discriminant analysis are found in Duda et al. (2001).

CR IP T

ACCEPTED MANUSCRIPT

3.3. Naïve Bayes

Naïve Bayes (NB) classifier is a simple classifier that often performs well in practice (John and Langley 1995), while it is covered in several

AN US

classification settings (Baesens et al. 2003), including marketing (Lessmann and Voss 2008). Before training the naïve Bayes model, continuous variables are discretized because there is empirical and theoretical evidence that this is beneficial for the naïve Bayes model (Yang and Webb 2003, Hsu et al. 2000). Training a naïve Bayes classifier is equal to learning the conditional probabilities p(xi|y) of each customer i with i=1,2,…,N given the dependent variable y. As all independent variables are discrete (categorical), conditional probabilities are calculated using the frequency counts. A new customer is then classified by using Bayes‘ rule to compute the posterior probability of each class y given the

p(y x) 

p(x y) p(y) p(x)

M

vector of observed independent variable values:

(4)

ED

The ―naïve‖ part of the classifier refers to the conditional independence of the independent variables to each other, given the group membership (responder/non-responder). This results in:

PT

N p(x y)  ∏ p( x y ) i i 1

(5)

AC

CE

More information on naïve Bayes and its extensions are found in Cui et al. (2006) or Friedman et al. (1997).

CR IP T

ACCEPTED MANUSCRIPT

3.4. Neural Networks

An artificial neural network (NEURAL) is defined as ‗a massively parallel distributed processor made up of simple processing units, which

AN US

has a natural propensity for storing experiential knowledge and making it available for use‘ (Haykin 1999, p. 2). Artificial neural networks mimic the structure and functioning of the brain. They often produce high predictive performance and are often used on a wide variety of business problems (e.g. Curry and Moutinho 1993), including direct marketing response applications (Baesens et al. 2002, Zahavi and Levin 1997).

Recent work (e.g. Kang et al. 2012) investigated the effectiveness of neural networks in direct mail campaigns. For instance, Guido et al.

M

(2011) found that predictive performance was significantly higher in comparison with regression analysis. This study employs the multi-layer perceptron (MLP) neural network, where statistically speaking the artificial neural network can be seen as a non-linear, non-parametric

ED

regression model (White 1992).

The MLP neural network is structured along three different layers of neurons; (i) an input layer, (ii) a hidden neuron layer and (iii) an output

PT

layer.

INSERT FIGURE 1 OVER HERE

Each independent variable corresponds to a node in the input layer, while the dependent variable is represented in the output layer. In

AC

CE

between the input and output layer, one or more hidden layers of neurons are introduced to capture non-linearity of the data into the model. All

CR IP T

ACCEPTED MANUSCRIPT

the input nodes are connected to the hidden neurons, whereas every hidden neuron is connected to the output node. The output of the neural network or the dependent variable is computed using a linear combination of the different layers in the neural network. One can use as many hidden layers as needed, but neural network literature (Bishop 1995) shows that one hidden layer is complex enough to be a universal function

AN US

approximator. Figure 1 shows a neural network with three independent variables in the input layer, one hidden layer with three neurons and one dependent variable in the output neuron. All independent variables are connected to each neuron in the hidden layer, and every hidden node is connected to the output neuron. All connections are assigned a weight value representing the strength of the connection. Training a neural network implies finding weight values for each connection so that the network best discriminates between responders and non-responders. More concretely, the dependent variable‘s value y is computed by the following equations (Ha et al. 2005): ) and y = ̅ ∑

M

zj= σ (∑

(6)

with xi the ith independent variable, zj the jth hidden neuron, y is the output neuron, wji is the weight from the ith node to the jth node, wj is the

ED

weight from the jth node to the output node and σ and ̅ is the transfer function of the hidden layer and the output layer. Possible transfer functions that can be used are the logistic function, 1/(1+exp(-x)), the hyperbolic tangent function (exp(x)-exp(-x))/(exp(x)+exp(-x)) or the linear

PT

transfer function x. Several possibilities exist to optimize aneural network architecture. Among the most popular ones are the back-propagation algorithm (Rumelhart et al. 1986) and the Levenberg-Marquardt algorithm (Levenberg 1944, Marquardt 1963, Bishop 1995). More information

CE

about the optimization algorithms for neural networks is found in Bishop (1995).

AC

3.5. Decision Trees

CR IP T

ACCEPTED MANUSCRIPT

Decision Trees are considered one of the most popular and accessible data-mining techniques in industry. Murthy (1998) found more than 300 academic references to decision trees in a variety of settings from statistics, decision science, engineering to management. For instance, McCarty and Hastak (2007) use a decision tree for customer segmentation, Benoit and Van den Poel (2012) for customer attrition, Morwitz and

AN US

Schmittlein (1992) for sales forecasting, and Rada (2005) for survey response. A decision tree starts with the entire customer dataset, called the root node, and successively splits the data into smaller subsets or internal nodes, with the purpose of finding homogenous groups of customers belonging to the same predefined group; responders or non-responders. Based on the splitting criteria and the splitting rules of the decision tree, three types are considered; CHAID, CART and C4.5.

CHAID is an acronym for Chi-squared Automatic Interaction Detection. Initially introduced by Kass (1980), CHAID builds non-binary

M

decision trees, i.e. decision trees where more than two subsets or branches can be attached to a single root or node. The next best split at each step in the tree is determined based on the significance of a Chi-square test.

ED

CART stands for Classification And Regression Trees (Breiman et al. 1984, Steinberg and Colla 1998). In contrast to CHAID, CART uses binary recursive partitioning methodology, which constantly splits the parent node into two subsets or child nodes. The CART algorithm uses the

PT

GINI impurity index as the quality-of-split criterion. The GINI index measures the homogeneity in the distribution of the dependent variable. For example, in a direct marketing setting, the GINI falls to its minimum (zero) when a split perfectly classifies all customers into responders and

CE

non-responders. Let p1 (p0) the proportion of responders (non-responders) in node t with p1 = 1-p0, the GINI impurity index (GI) in a binary

AC

classification setting is defined as

CR IP T

ACCEPTED MANUSCRIPT

GI = 1–p12–p02

(7)

The best split is the one that maximizes the decrease in GINI impurity measure between a node t and the partitioned child nodes tl and tr, ∆(t,s) = GIt – pl GItl – pr GItr

AN US

(8)

with ∆(t,s) the decrease in impurity using the splitting rule s, p l and pr the probabilities of an observation of going left and right. GIt, GItl and GItr are GINI impurities of node t, tl and tr. The tree is maximally grown and pruned by minimizing the loss of predictive accuracy. C4.5 is another algorithm for decision tree building introduced by Quinlan (1993). C4.5 makes its splits based on the concept of information gain, where information gain is defined as the reduction in entropy due to splitting on an independent variable x i. Entropy is a measure of

M

uncertainty associated with a random variable. In practice, at the root node of the decision tree, it is highly uncertain to which group, i.e. responders versus non-responders, a particular customer will belong. As one moves from the top of the decision tree, down to the terminal nodes

ED

(i.e. the lowest levels of the decision tree), the uncertainty is reduced because one becomes more and more certain about the customer‘s behavior (response versus non-response). For a given training set with N labeled training examples {(xi,yi)} with i=1,2,…,N with input data xi є Rn and corresponding binary target labels yi є {0,1}, C4.5 induces decision trees using the concept of normalized information gain or gain ratio. Let

PT

p1(p0) the proportion of responders (non-responders) in subset S with p1=1-p0, the entropy of S is calculated as

CE

Entropy(S) = -p1 log2(p1) – p0 log2(p0)

(9)

AC

and the information gain (IGxi) is calculated as being the reduction in entropy due to splitting on an attribute xi as

q

IGxi = Entropy(S) -

Sv

∑S

Entropy(Sv )

v 1

CR IP T

ACCEPTED MANUSCRIPT

(10)

information gain ratio (IGRxi) is defined as

IGRxi =

IG xi q

Sk

v 1

S

-

log 2

AN US

with v a unique value of xi, q the total number of distinct values of xi and Sv the subsample of S with xi fixed to a certain value. The

(11)

Sk S

with v a unique value of xi, q the total number of distinct values of xi and Sv the subsample of S with xi fixed to a certain value. C4.5 favors

M

splits with the largest gainratio by means of recursive partitioning.

Pruning is an essential step in decision tree construction. Pruning is defined as finding the right size of the tree by stopping splitting the tree

ED

(CHAID) or by post-cutting the tree (CART, C4.5) in order to preserve a stable performance on a hold-out test set. Several advanced extensions of decision trees are recently proposed (e.g. C4.5 rules (Quinlan 1993) or random forests (Breiman 2001)).

PT

More details on CHAID, CART and C4.5 are found in Blattberg et al. (2008). 3.6. k-Nearest Neighbor Algorithm

AC

CE

The k-nearest neighbor or k-NN classifier (KNN) is a very intuitive classification technique that is already applied in a number of industries

CR IP T

ACCEPTED MANUSCRIPT

like credit scoring (Baesens et al. 2003) and indirect lending (Sinha and Zhao 2008). The classification of a customer is achieved by finding the k-most similar customers or neighbors in the training set using the available independent variables (Hastie et al. 2001, Govindarajan and Chandrasekaran 2010). Among these k nearest neighbors, the response behavior is explored. The new customer is assigned a predefined group

d(xi,xj) = x i  x j

=

AN US

by a majority vote of its k nearest neighbors in the training set. The neighbors are identified using the Euclidean distance similarity measure: (x i  x j ) T (x i  x j )

(12)

where xi, xj є Rn are the independent variable vectors of customers i and j. More details on the k-NN estimation is found in Hastie et al. (2001), while more advanced nearest-neighbor variations could be found in literature (e.g. Aha et al. 1991). EVALUATION METRIC

M

4.

Hanley and McNeil (1982) state that the performance of binary classification methods could be assessed by calculating and comparing their

ED

areas under the receiver operating characteristics curve (AUC). Several authors like Provost et al. (2000) and Langley (2000) argue that AUC is an objective performance criterion amongst all available evaluation metrics for the comparison of classifier performance. Moreover, it is often

PT

used to evaluate response models (Baesens et al. 2002).

Figure 2 gives a representation of the components of the AUC concept. Panel a represents the confusion matrix, while Panel b gives a

AC

CE

visual representation of a receiver operating characteristics curve. INSERT FIGURE 2 OVER HERE

CR IP T

ACCEPTED MANUSCRIPT

The confusion matrix is a visualization tool often used in the context of binary classification (see Figure 2 – Panel a). In the columns, one finds the actual customer behavior (i.e. responded or not responded), while the rows represent the predicted customer behavior by setting a given threshold on the response probabilities output by the classification algorithm. The customers with a higher (lower) response probability as the

AN US

threshold are classified as (non-)responders. If TP (True Positives) are the number of responders that are correctly identified, FP (False Positives) are the number of non-responders that are classified as responder, FN (False Negatives) are the number of responders that are identified as nonresponder, and TN (True Negatives) are the number of non-responders that are classified as non-responder, then 

the true positive rate (TPR) or sensitivity is (TP/(TP+FN)): the proportion of actual responders that are predicted to be a responder among all actual responders.

the false positive rate (FPR) or (1-specificity) (FP/(TN+FP)): the proportion of non-responder that are classified as responder among all

M



ED

actual non-responders.

It is clear that, by changing the classification threshold on the response probabilities, the TPR and the FPR will change. The receiver operating characteristics curve (ROC) is a graphical plot of the FPR on the X-axis and the TPR on the Y-axis taking into account not one

PT

threshold but all possible thresholds on the response probabilities (see Figure 2 – Panel b). The area under the receiver operating characteristics curve is used as a measure of prediction performance. In sum, the AUC compares the predicted response behavior with the real response

CE

behavior of the customers considering all possible thresholds on the response probabilities. A random classifier has an AUC of 0.5 (see dashed

AC

line – Figure 2 – Panel b), while a perfect classification model is represented by the (0,1) point in the ROC graph (see Figure 2 – Panel b) having

CR IP T

ACCEPTED MANUSCRIPT

a 100% sensitivity (no false negatives) and a 100% specificity (no false positives). The latter results in an AUC of 1.

Intuitively, the AUC is an estimated probability that a randomly chosen instance of class 1 (response) has a higher posterior probability than

AN US

a randomly selected instance of class 0 (non-response). For instance, suppose that a response model has an AUC of 0.70, this means that if you randomly pick an actual responder and a non-responder from the dataset, 70% of the times the responder will have a higher response probability output by the classifier than the non-responder. 5. EMPERICAL SETTING 5.1. Datasets

M

The classification performance of the different algorithms is assessed using four real-life direct mail datasets provided by the Marketing Edge (formerly known as the Direct Marketing Educational Foundation), and previously used in recent response modelling papers (e.g.

ED

Coussement et al. 2014, McCarty and Hastak 2007). The first dataset originates from a non-profit organization that solicits additional contributions from its previous donors. The second dataset is for a company with multiple divisions‘ mailing catalogues to a unified customer database, while the third dataset contains response information about a specialty catalogue company that mails both full line and seasonal

PT

catalogues to its customer base. Finally, the fourth dataset is from an upscale gift business that mails general and specialized catalogues to its customer base several times each year. The dependent variable reflects whether or not the customer responds to the direct mail sent.

CE

Furthermore, all four datasets contain similar types of customer profile information summarized in the independent variables. They contain

AC

socio-demographic information about the donors/customers, as well as transactional information such as timings, amounts, and frequencies of

CR IP T

ACCEPTED MANUSCRIPT

previous donation or purchase behavior. As seen from Table 1, the datasets are very similar in terms of number of observations (about 100,000 customers), but differ by response rates and number of variables. This makes this setting an ideal test bed for benchmarking the classification performance of the response models over a variety of datasets.

AN US

INSERT TABLE 1 OVER HERE

The purpose is to find a prediction model that strongly discriminates between responders and non-responders in a certain mailing context. All datasets contain actual response behavior and all independent variables containing no missing information are taken into account. As a result, an objective starting bench of variables is considered because literature (e.g. Crone et al. 2006) has shown that pre-processing data could influence and favor one technique over the other.

M

5.2. Experimental Setup

In order to compare the different classification approaches, this study applies a 10-fold cross-validation procedure as suggested by Witten

ED

and Frank (2000). More generally, in a 10-fold cross-validation, the complete data set is randomly split into 10 subsets of equal size. Iteratively, one part, i.e. the test set, is used for validating the analysis, while the other remaining 9 parts are used for estimating the classification model (i.e.

PT

the training set). Measuring the performance on separate test data is necessary because then one avoids finding (statistical) relations that predict idiosyncratic characteristics of the training data that do not hold up in the real world (Blattberg et al. 2008). In sum, the cross-validated

CE

performance reflects the true capabilities of the classification approach better as it reduces the variability of the validation results as compared to

AC

a random one shot split of the dataset in the training and test sets by building ten classification models on slightly different training sets

CR IP T

ACCEPTED MANUSCRIPT

(Malthouse 2001). Since each case in the original dataset only belongs to a test set once, each case receives one response probability during the cross-validation procedure.

AN US

This study reports the cross-validated AUC, which is the AUC over the 10 test sets. In order to assess whether AUCs of the different classification techniques are significantly different on a dataset, the non-parametric test of DeLong et al. (1988) is used. In addition, logistic regression, linear and quadratic discriminant analyses employ a stepwise variable selection procedure with a significance level of 0.05. Variable selection is the process of choosing a subset of the original variables by eliminating some variables based on their predictive performance. Kim et al. (2006) states that variable-selection techniques save computational time and cost by extracting as much

M

information possible with the smallest number of variables, while they result in making the classification model generalize better. Furthermore, the naïve Bayes response model needs discretized independent variables. Therefore, we discretize the continuous variables using the ‗equal frequency width‘-method which divides the sorted values of the continuous variable into ten bins so that each bin represents approximately the

ED

same number of adjacent values (Yang and Webb 2003). Moreover, we implement multiple neural network classifiers. They all contain one hidden layer and they were trained using a varying number of hidden neurons from 1 to 5. The one with the best training set performance was

PT

selected for test set evaluation. The input transfer function is set to the often-used hyperbolic tangent function, while the output transfer function is set to the logistic function since the output is then restricted to [0,1] to be represented as the posterior probability of response (Bishop 1995).

CE

Further, CHAID‘s pruning threshold is set to 0.25, while CART and C.4.5 post-prune the trained decision trees using the maximum performance

AC

objective criterion (Quinlan 1993, Breiman et al. 1984, Steinberg and Colla 1998). Finally, we use 10 (KNN-10) and 100 (KNN-100) nearest

CR IP T

ACCEPTED MANUSCRIPT

neighbors for the k-NN algorithm. 6. RESULTS

AN US

Table 2 contains the raw cross-validated AUC values per algorithm per dataset. A special visual notation is used to indicate which algorithms are (not) significantly different to the best performing algorithm on the dataset (Baesens et al. 2003). Per dataset, the best performing algorithm, i.e. the one with the highest AUC, is underlined and set in bold. The algorithm(s) which is/are not significantly different at a 5% significance level with the best algorithm is/are only set in bold, while the algorithms with a significantly inferior performance than the best one stay neutral (DeLong et al. 1988). Furthermore, Table 3 summarizes the findings from the four datasets together by reporting the pair-wise

M

comparisons among the different algorithms by means of a wins/losses table. An entry X/Y in Table 3 means that the algorithm in the row has X (Y) times a better (worse) performance than the algorithm in the corresponding column. Moreover, the last column of Table 3 gives the average ranking of the different algorithms over the four datasets. More practically, a ranking of the algorithms is made per dataset based on the

ED

classification performance in Table 2. A classifier receives a rank 1 when it is the best performing algorithm, while rank 10 indicates the worst performing algorithm. Afterwards, these rankings are averaged over the four datasets and summarized in the last column of Table 3. The lower

AC

CE

PT

the average rank value, the better the algorithm.

INSERT TABLE 2 OVER HERE INSERT TABLE 3 OVER HERE

INSERT TABLE 4 OVER HERE

CR IP T

ACCEPTED MANUSCRIPT

It is clear from Table 2 that the data-mining algorithms, neural networks and decision trees yield the best performance in this direct

AN US

marketing setting. More specifically, NEURAL is the best performing algorithm on two datasets, while CHAID is the best performing algorithm on the third dataset. It is observed from Table 2 that among the decision-tree algorithms (CHAID, CART and C4.5), CHAID is the best performing. Note only on dataset 2, LOG is the best algorithm, while LDA reports a performance that is not significantly inferior to the best performing algorithm.

It is seen from Table 3 that, among the decision-tree classifiers, CHAID and CART are superior to C4.5. The former algorithms beat

M

C4.5 on all four benchmark datasets. Moreover, it is observed that amongst the k-nearest neighbor variations, the one with the highest number of nearest neighbors (KNN-100) succeeds in better differentiating responders from non-responders. Furthermore, out of the statistical algorithms (LOG, LDA, QDA and NB), Table 3 reveals that LOG and LDA behave competitively, while both algorithms perform better than NB and QDA.

ED

Finally, NEURAL yields performance competitive with the decision-tree algorithms, CHAID and CART, while NEURAL yields a better performance when compared to C4.5 and the other non-decision-tree-based algorithms in this benchmark study.

PT

In summary, and taking into account the average rankings in Table 4, one can conclude that the data-mining techniques, CHAID, CART and NEURAL seem to perform well in this benchmark setting. On the other hand, when only considering pure statistical approaches for

CE

discriminating between responders and non-responders, it is shown that LOG and LDA are the better performers. Classification algorithms like

AC

QDA, NB, KNN-10, KNN-100 and C4.5 yield poor performance.

CR IP T

ACCEPTED MANUSCRIPT

7. CONCLUSIONS

For marketing managers, the pressure for direct mail campaigns to deliver high response rates is heightening (Guido et al. 2011, Zahavi and

AN US

Levin 1997), which means that response modelling should become an important asset in the toolset of every direct marketer. This study contributes to the existing direct marketing literature in the following ways. First, this study introduces well-known response models originating from the statistics and data-mining literature to the direct marketing community. It motivates marketing managers to reflect on the merits and drawbacks of the various response models used in this research study, before going to the implementation stage of their campaigns. Second, marketing managers should acknowledge that the ―Who should we contact?‖ question is very important, and could be optimized

M

with the correct choice of response model. Table 5 details the response model‘s choice drivers an analyst should consider in the decision process, including the spread of the classification algorithm in conventional software packages, the ease of implementation, the resolution time, and the

ED

merits (+) and drawbacks (-) of the estimation method.

INSERT TABLE 5 OVER HERE

Additionally to Figure 5, we conclude that data-mining techniques produce better predictions due to their ability to ‗learn‘ to improve their

PT

performance from the experience of doing (Mitchell 1997). This is different from conventional compensatory linear models, based on statistical techniques, such as logistic regression or (linear or quadratic) discriminant analysis, which are used to mimic individual choice processes (Levin

AC

CE

and Zahavi 1998, Malthouse 1999, West et al. 1997). Despite the strengths of these statistical techniques, they are less effective in discriminating

CR IP T

ACCEPTED MANUSCRIPT

responders from non-responders, because they evaluate a narrow range of different solutions (Guido et al. 2011), while customer responses are often non-linear and non-compensatory.

AN US

Third, the trade-off between interpretability and predictive capacity is an important consideration. To illustrate this, Table 6 offers guidelines for marketing managers to facilitate their choice based on the benchmark results of this research study. It contains the different response models based on classifier type (data-mining versus statistical techniques), the classifier‘s output relevant for discovering variable importance, accuracy based on the average ranking as mentioned in Table 4, and the interpretability of the final classification model. INSERT TABLE 6 OVER HERE

M

Indeed, data-mining algorithms often have a better performance than their statistical counterparts, while on the other hand statistical classifiers are often more interpretable than the complex data-mining techniques. This paper confirms this specific trade-off between the better

ED

performing but hardly interpretable data-mining algorithms (CHAID, CART and NEURAL), and on the other hand the statistical techniques (LOG and LDA) that give better insight into the final decision model but are often less competitive in terms of performance. This trade-off is often acknowledged by analysts in business as well as researchers in academia.

PT

Although this study sheds light on an important topic in the direct marketing field, several valuable paths for future research could be offered. First, the move towards increasing the comprehensibility of the often better performing and more complex data-mining algorithms will

AC

CE

open new paths for direct marketers. For instance, using rule extraction to increase the interpretability of artificial neural networks is a first step

CR IP T

ACCEPTED MANUSCRIPT

in this direction (Baesens et al. 2003). In general, the notion of interpretability by marketers is an area ripe for future research, which may require a mix of statistical expertise and marketing experience. Incorporating these two dimensions in novel response models could have a significant impact on the direct marketing domain. Second, our benchmark study incorporates a wide variety of response models selected based on their

AN US

previous use in academic literature, and their popularity amongst marketing professionals. The response models used in this research study do not exhibit an incremental implementation cost favoring one response model over the other, as all our classification algorithms are packaged in major statistical software packages. However, future research could investigate the beneficial effect of more advanced algorithms like support vector machines (Vapnik 1995), extreme learning machines (Huang et al. 2006), random forests (Breiman 2001), and various boosting variations (Freund and Schapire 1997) for the direct mail domain. However, the latter set of algorithms often require long implementation and meta-

M

parameter optimization times due to the large direct mail datasets. Moreover, these algorithms are not readily available in commercial statistical software packages. Therefore, business professionals will face budget and time constraints to implement these algorithms, while IT compliance

implementations of these algorithms.

ED

will often hinder their correct implementation in a day-to-day business context, because of the absence of well-tested, scalable software

PT

Third, the differences in predictive performance could be viewed as unimportant by marketing managers, but it is important for them to realize that these relatively small differences in predictive performance often have a huge impact on the bottom line profit due to the large scalability of the direct mail context. For example, it is known from previous marketing literature that minor increases in predictive performance

AC

CE

result in significant changes in bottom-line contributions (e.g. Reichheld and Sasser 1990). This stems from the fact that every percentage point

CR IP T

ACCEPTED MANUSCRIPT

increase in predictive performance decreases the contact cost per responding customer, and thus increases the expected revenue of the direct mail campaign. A valuable path for future research would be to investigate and frame the main components in a return on investment calculation for

ACKNOWLEDGMENTS

AN US

response modelling, i.e. building a response model and scoring it on a new set of customers.

The authors thank the Marketing Edge (formerly known as Direct Marketing Educational Foundation) for making the direct mail datasets available. Moreover, we would like to thank Alfredo Rocatto for freely distributing the naïve Bayes code to us and our friendly reviewers for

AC

CE

PT

ED

M

their useful comments on earlier versions of this research paper.

CR IP T

ACCEPTED MANUSCRIPT

REFERENCES

Aha, D. W., Kibler, D. & Albert, M. K. (1991). Instance-based learning algorithms, Machine Learning, 6, 37-66.

AN US

Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J. & Vanthienen, J. (2003). Benchmarking state-of-the-art classification algorithms for credit scoring, Journal of the Operational Research Society, 54(6), 627-635.

Baesens, B., Viaene, S., Van den Poel, D., Vanthienen, J. & Dedene, G. (2002). Bayesian neural network learning for repeat purchase modelling in direct marketing, European Journal of Operational Research, 138, 191-211.

Barwise, P. & Farley, J. U. (2005). The state of interactive marketing in seven countries: Interactive marketing comes of age, Journal of Interactive Marketing, 19 (3), 67-80.

with Applications, 39 (13), 11435-11442.

M

Benoit, D. F. & Van den Poel, D. (2012). Improving customer retention in financial services using kinship network information, Expert Systems

Berger, P. & Magliozzi, T. (1992). The effect of sample size and proportion of buyers in the sample on the performance of list segmentation

ED

equations generated by regression analysis, Journal of Direct Marketing, 6, 13-22. Bishop, C. M. (1995). Neural networks for pattern recognition, Oxford university press.

PT

Blattberg, R. C., Kim, B. D. & Neslin, S. A. (2008). Database marketing: analyzing and managing customers, springer. Bose, I. & Chen, X. (2009). Quantitative models for direct marketing: A review from systems perspective, European Journal of Operational Research, 195, 1-16.

AC

CE

Breiman, L. (2001). Random forests, Machine Learning, 45, 5-32.

CR IP T

ACCEPTED MANUSCRIPT

Breiman, L., Friedman, J., Olshen, R. A. & Stone, C. J. (1984). Classification and regression trees, Wadsworth international group. Buckinx, W., Moons, E., Van den Poel, D. & Wets, G. (2004). Customer-Adapted Coupon Targeting Using Feature Selection, Expert Systems with Applications, 26, 509-518.

AN US

Bult, J. R. (1993). Semiparametric Versus Parametric Classification Models: An Application to Direct Marketing, Journal of Marketing Research, 30, 380-390.

Chen, W. C., Hsu, C. C. & Chu, Y. C. (2012). Increasing the Effectiveness of Associative Classification in Terms of Class Imbalance by Using a Novel Pruning Algorithm, Expert Systems with Applications, 39 (17), 12841-12850.

Chen, W. C., Hsu, C. C. & Hsu, J. N. (2011). Optimal Selection of Potential Customer Range through the Union Sequential Pattern by Using a Response Model, Expert Systems with Applications, 38 (6), 7451-7461.

M

Conlon, G. (2015). Where Will Marketing Grow in 2015?, [online], available: http://www.dmnews.com/where-will-marketing-grow-in2015/article/391699/

Coussement, K., Van den Bossche, F. A. M. & De Bock, K. W. (2014). Data accuracy‘s impact on segmentation performance: Benchmarking

ED

RFM analysis, logistic regression, and decision trees, Journal of Business Research, 67 (1), 2751-2758. Crone, S. F., Lessmann, S. & Stahlbock, R. (2006). The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct

PT

marketing, European Journal of Operational Research, 173, 781-800. Cui, G., Wong, M. L. & Lui, H.-K. (2006). Machine Learning for Direct Marketing Response Models: Bayesian Networks with Evolutionary Programming, Management Science, 52, 597-612.

CE

Cui, G., Wong, M. L. & Zhang, G. C. (2010). Bayesian Variable Selection for Binary Response Models and Direct Marketing Forecasting,

AC

Expert Systems with Applications, 37 (12), 7656-7662.

CR IP T

ACCEPTED MANUSCRIPT

Curry, B. & Moutinho, L. (1993). Neural networks in marketing: Modelling consumer responses to advertising stimuli, European Journal of Marketing, 27 (7), 5.

Danaher, P. J. & Rossiter, J. R. (2011). Comparing perceptions of marketing communication channels, European Journal of Marketing, 45 (1/2),

AN US

6-42.

Deichmann, J., Eshghi, A., Haughton, D., Sayek, S. & Teebagy, N. (2002). Application of multiple adaptive regression splines (MARS) in direct response modeling, Journal of Interactive Marketing, 16 (4), 15-27.

DeLong, E. R., DeLong, D. M. & Clarke-Pearson, D. L. (1988). Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach, Biometrics, 44, 837-845.

Direct Marketing Association (2009). The power of direct marketing: ROI, sales, expenditures, and employment in the US edition 2009-2010,

M

DMA publishing.

Duda, R. O., Hart, P. E. & D.G, S. (2001). Pattern classification, John Wiley & Sons Inc. Freund, Y. & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer

ED

and System Sciences, 55, 119-139.

Friedman, N., Geiger, D. & Goldszmidt, M. (1997). Bayesian network classifiers, Machine Learning, 29, 131-163.

PT

Govindarajan, M. & Chandrasekaran, R. M. (2010). Evaluation of k-Nearest Neighbor classifier performance for direct marketing, Expert Systems with Applications, 37 (1), 253-258. Guido, G., Prete, M. I., Miraglia, S. & De Mare, I. (2011). Targeting direct marketing campaigns by neural networks, Journal of Marketing

CE

Management, 27 (9-10).

AC

Ha, K., Cho, S. & MacLachlan, D. (2005). Response models based on bagging neural networks, Journal of Interactive Marketing, 19 (1), 17-30.

CR IP T

ACCEPTED MANUSCRIPT

Hanley, J. A. & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (roc) curve, Radiology, 143, 2936.

Hastie, T., Tibshirani, R. & Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer-Verlag.

AN US

Haughton, D. & Oulabi, S. (1997). Direct marketing modeling with CART and CHAID, Journal of Direct Marketing, 11, 42-52. Haykin, S. S. (1999). Neural Networks: A Comprehensive Foundation, Prentice Hall International. Hosmer, D. W. & Lemeshow, S. (2000). Applied Logistic Regression, John Wiley & Sons Inc.

Hsu, C. N., Huang, H. J. & Wong, T. T. (2000). Why discretization works for naïve Bayesian classifiers, in Proceedings of the International Conference on Machine Learning (ICML-2000), Stanford, CA, USA.

Huang, G.-B., Zhu, Q.-Y. & Siew, C.-K. (2006). Extreme learning machine: Theory and applications, Neurocomputing, 70 (1-3), 489-501.

M

John, G. H. & Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers, translated by Morgan Kaufmann Publishers. Jonker, J.-J., Piersma, N. & Van den Poel, D. (2004). Joint optimization of customer segmentation and marketing policy to maximize long-term

ED

profitability, Expert Systems with Applications, 27 (2), 159-168. Kaefer, F., Heilman, C. M. & Ramenofsky, S. D. (2005). A neural network application to consumer classification to improve the timing of direct marketing activities, Computers & Operations Research, 32, 2595-2615.

PT

Kang, P., Cho, S. & MacLachlan, D. L. (2012). Improved Response Modeling Based on Clustering, under-Sampling & Ensemble, Expert Systems with Applications, 39 (8), 6738-6753.

CE

Kass, G. V. (1980). An Exploratory Technique for Investigating Large Quantities of Categorical Data, Journal of the Royal Statistical Society.

AC

Series C (Applied Statistics), 29, 119-127.

CR IP T

ACCEPTED MANUSCRIPT

Kim, S. Y., Jung, T. S., Suh, E. H. & Hwang, H. S. (2006). Customer segmentation and strategy development based on customer lifetime value: A case study, Expert Systems with Applications, 31 (1), 101-107.

Kumar, V. (2008). Managing Customers for Profit: Strategies to Increase Profits and Build Loyalty, Pearson Education.

AN US

Lamb, C. W., Hair, J. F. & McDaniel, C. (1994). Principles of marketing, Soulh-Westem Publishing Co.

Langley, P. (2000). Crafting papers on Machine Learning, translated by Langley, P., Stanford University, 1207 – 1216. Lee, H. J., Shin, H., Hwang, S. S., Cho, S. & MacLachlan, D. L. (2010). Semi-Supervised Response Modeling, Journal of Interactive Marketing, 24, 42-54.

Lessmann, S. & Voss, S. (2008). Supervised classification for decision support in customer relationship management in Bortfeldt, A., Homberger, J., Kopfer, H., Pankratz, G. and Strangmeier, G., eds., Gabler Edition Wissenschaft, 231-253.

M

Levenberg, K. (1944). A method for the solution of certain nonlinear problems in least squares, Quarterly Joumal of Applied Mathematics, 11, 164-168.

ED

Levin, N. & Zahavi, J. (1998). Continuous predictive modeling—A comparative analysis, Journal of Interactive Marketing, 12 (2), 5-22. Mahdiloo, M., Noorizadeh, A. & FarzipoorSaen, R. (2014). Optimal direct mailing modelling based on data envelopment analysis, Expert Systems, 31, 101–109.

PT

Malthouse, E. C. (1999). Ridge regression and direct marketing scoring models, Journal of Interactive Marketing, 13 (4), 10-23. Malthouse, E. C. (2001). Assessing the performance of direct marketing scoring models, Journal of Interactive Marketing, 15, 49-62.

AC

441.

CE

Marquardt, D. (1963). An Algorithm for Least-Squares Estimation of Nonlinear Parameters, SIAM Journal on Applied Mathematics, 11, 431-

CR IP T

ACCEPTED MANUSCRIPT

McCarty, J. A. & Hastak, M. (2007). Segmentation approaches in data-mining: A comparison of RFM, CHAID, and logistic regression, Journal of Business Research, 60, 656-662. Mitchell, T. M. (1997). Machine Learning, McGraw-Hill.

AN US

Morgan, R. & Hunt, S. (1994). The Commitment-Trust Theory of Relationship Marketing, Journal of Marketing, 58(3), 20-38.

Morwitz, V. G. & Schmittlein, D. (1992). Using Segmentation to Improve Sales Forecasts Based on Purchase Intent: Which "Intenders" Actually Buy?, Journal of Marketing Research, 29, 391-405.

Murthy, S. K. (1998). Automatic construction of decision trees from data: A multi-disciplinary survey, Data Mining and Knowledge Discovery, 2, 345-389.

Neslin, S. A., Gupta, S., Kamakura, W., Lu, J. X. & Mason, C. H. (2006). Defection detection: Measuring and understanding the predictive

M

accuracy of customer churn models, Journal of Marketing Research, 43 (2), 204-211.

Ngai, E. W. T., Xiu, L. & Chau, D. C. K. (2009). Application of data mining techniques in customer relationship management: A literature

ED

review and classification., Expert Systems with Applications, 36 (2), 2592-2602. Noble, S. M. & Schewe, C. D. (2003). Cohort segmentation: An exploration of its validity, Journal of Business Research, 56, 979-987. Provost, F., Fawcett, T. & Kohavi, R. (2000). The Case against Accuracy Estimation for Comparing Induction Algorithms, translated by

PT

Shavlik, J., Morgan Kaufman, 445-453.

Quinlan, J. R. (1993). C4.5: Programs for machine learning, Morgan Kaufmann Publishers. Rada, V. D. a. D. (2005). The Effect of Follow-up Mailings on The Response Rate and Response Quality in Mail Surveys, Quality and Quantity,

AC

CE

00039(00001), 1-19.

CR IP T

ACCEPTED MANUSCRIPT

Reichheld, F. F. & Sasser, W. E. (1990). Zero defections - quality comes to services, Harvard Business Review, 68, 105-111.

Risselada, H., Verhoef, P. & Bijmolt, T. (2014). Dynamic Effects of Social Influence and Direct Marketing on the Adoption of High-Technology Products, Journal of Marketing, 78 (2), 52-68.

AN US

Rowe, C. W. (1989). A Review of Direct Marketing and How It Can Be Applied to the Wine Industry, European Journal of Marketing, 23 (9), 5.

Rumelhart, D. E., Hinton, G. E. & Williams, R. J. (1986). Learning internal representation by error propagation in Rumelhart, D. E. and McClelland, J. A., eds., MIT press.

Ryals, L. (2005). Making customer relationship, management work: The measurement and profitable management of customer relationships, Journal of Marketing, 69, 252-261.

M

Sin, L. Y. M., Tse, A. C. B. & Yim, F. H. K. (2005). CRM conceptualization and scale development, European Journal of Marketing, 39 (11/12), 1264-1290.

Support Systems, 46, 287-299.

ED

Sinha, A. R. & Zhao, H. M. (2008). Incorporating domain knowledge into data mining classifiers: An application in indirect lending, Decision

Steinberg, D. & Colla, P. (1998). Classification and regression trees, Salford Systems.

17 (2), 89-97.

PT

Suh, E. H., Noh, K. C. & Suh, C. K. (1999). Customer list segmentation using the combined response model., Expert Systems with Applications,

Vapnik, V. N. (1995). The nature of statistical learning theory, Springer-Verlag New York, Inc.

CE

Watjatrakul, B. & Drennan, J. (2005). Factors Affecting E-Mail Marketing Sourcing Decisions: A Transaction Cost Perspective, Journal of

AC

Marketing Management, 21 (7), 701-723.

CR IP T

ACCEPTED MANUSCRIPT

West, P. M., Brockett, P. L. & Golden, L. L. (1997). A comparative analysis of neural networks and statistical methods for predicting consumer choice, Marketing Science, 16, 370-370.

White, H. (1992). Artificial Neural Networks: Approximation and Learning Theory, Blackwell.

AN US

Wierich, R. & Zielke, S. (2014). How retailer coupons increase attitudinal loyalty - the impact of three coupon design elements, European Journal of Marketing, 48 (3/4), 699-721.

Witten, I. & Frank, E. (2000). Data mining: practical machine learning tools and techniques with java implementations, Morgan Kaufmann Publisher.

Yang, Y. & Webb, G. I. (2003). On why discretization works for naive-Bayes classifiers in Gedeon, T. D. and Fung, L. C. C., eds., SpringerVerlag Berlin, 440-452.

AC

CE

PT

ED

M

Zahavi, J. & Levin, N. (1997). Applying neural computing to target marketing, Journal of Interactive Marketing, 11(4), 76-93.

CR IP T

ACCEPTED MANUSCRIPT

AN US

Figure 1 The neural network structure with three input variables (xi), three hidden neurons (zj), one output neuron (y) and the corresponding weights (wji and wj).

z1

x1 x2

z2

wji

Hidden neuron layer zj

AC

CE

PT

ED

Input layer xi

z3

M

x3

y

wj Output layer y

CR IP T

ACCEPTED MANUSCRIPT

Figure 2 A representation of the components of the AUC concept for response modelling. Panel a The confusion matrix with True Positives, False Positives, True Negatives and False Negatives. Panel b A representation of a receiver operating characteristics curve (dashed line = a random model with AUC 0.5).

AN US

Panel a

Actual Behavior

Response Predicted outcome

Non-response

Response

Non-response

True Positives (TP)

False Positives (FP)

False Negatives True Negatives (FP) (TN)

M

Panel b

ED

0.8

0.6

0.4

0.2

0 0

PT

True Positive Rate (TPR) or sensitivity

1

0.2

0.4

0.6

0.8

False Positive Rate (FPR) or (1-specificity)

AC

CE

Table 1 Descriptive statistics of the datasets.

1

1 2 3 4

Non-profit organization Catalogue company Specialty catalogue company Gift catalogue company

Response setting

# observations

Donated after direct mail? Bought product after receiving catalogue? Bought product after receiving catalogue? Bought product after receiving catalogue?

99,200 96,551 106,284 101,532

Table 2 Cross-validated AUC performance measures per dataset.

CE

PT

ED

M

Dataset LOG LDA QDA NEURAL NB CHAID CART C4.5 1 0.6630 0.6685 0.6104 0.6206 0.6839 0.6733 0.6685 0.6843 2 0.6025 0.5822 0.6366 0.6394 0.5372 0.6420 0.6411 0.5969 3 0.8199 0.8150 0.7701 0.8225 0.7305 0.8622 0.8360 0.8192 4 0.7900 0.7959 0.7528 0.6706 0.8157 0.8007 0.7763 0.8231 Bold & underlined = best algorithm. Bold = algorithm not statistically different performance than best algorithm. Non-formatted= algorithm with statistically inferior performance than best algorithm.

AC

# independent variables 11 142 250 87

AN US

Dataset Direct marketing business

CR IP T

ACCEPTED MANUSCRIPT

KNN-10 0.6285 0.5235 0.6783 0.6976

KNN-100 0.6693 0.5665 0.7503 0.7635

% responders 27.42% 2.46% 5.36% 9.56%

Table 3 Wins/losses and average ranking table.

LDA

QDA

NB

NEURAL

CHAID

CART

C4.5

KNN-10

KNN-100

Average Ranking

x 2/2 0/4 0/4 3/1 3/1 3/1 1/3 0/4 1/3

2/2 x 0/4 0/4 3/1 3/1 3/1 2/2 0/4 1/3

4/0 4/0 x 1/3 4/0 4/0 4/0 3/1 1/3 2/2

4/0 4/0 3/1 x 4/0 4/0 4/0 3/1 2/2 3/1

1/3 1/3 0/4 0/4 x 2/2 2/2 0/4 0/4 0/4

1/3 1/3 0/4 0/4 2/2 x 1/3 0/4 0/4 0/4

1/3 1/3 0/4 0/4 2/2 3/1 x 0/4 0/4 0/4

3/1 2/2 1/3 1/3 4/0 4/0 4/0 x 0/4 2/2

4/0 4/0 3/1 2/2 4/0 4/0 4/0 4/0 x 4/0

3/1 3/1 2/2 1/3 4/0 4/0 4/0 2/2 0/4 x

4.25 4.50 7.75 8.75 2.50 2.25 2.75 6.25 9.25 6.75

AN US

LOG

M

LOG LDA QDA NB NEURAL CHAID CART C4.5 KNN-10 KNN-100

CR IP T

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

Cell value X/Y = the algorithm in the row has X (Y) times a better (worse) performance than the algorithm in the corresponding column Average ranking value = average ranking of algorithm over the datasets.

CR IP T

ACCEPTED MANUSCRIPT

CHAID CART NEURAL LOG LDA C4.5 KNN-100 NB QDA KNN-10

Average rank 2 2.25 2.5 4.5 5.5 6.25 6.75 8 8 9.25

Table 5 Response model‘s choice drivers.

LOG

Wide

Simple

Resolution time

Fast

Classifier estimation method + Robust estimation

Wide

Simple

Fast

NB

Medium

Simple

Fast

Small amount of training data required

Difficult

Slow

Ability to capture complex, non-linear relationships

PT

LDA - QDA

CE

NEURAL

AC

Implementation

M

Spread

ED

Response model

AN US

Table 4 Average rank of the algorithms (with 1 = best performance and 10 = worst performance).

Medium

Unobserved latent utility follows logistic distribution Additive model Normality assumption of independent variables Independence assumption of independent variables Difficult to theoretically justify discretization method Complex model parameters optimization

CR IP T

ACCEPTED MANUSCRIPT

Black box model

Wide

Simple

Fast

KNN

Medium

Medium

Slow

Ability to capture complex, non-linear relationships No assumption on variable distributions Intuitively simple

Negative impact of irrelevant independent variables

AN US

CHAID CART - C4.5

Table 6. Managerial recommendations on the response models. STATISTICAL

Output

Interpretabil ity

NEURAL

CHAI D

Parameter estimates

Condition al probabiliti es

Decision-tree model, Weight of variables

MEDIU M HIGH

LOW

Weights of variable s, Nodes and biases HIGH LOW

LOW

MEDIU M MEDIU M

CE

based on the average ranks in Table 4

AC

QD A

DATA MINING Decision Trees

NB

PT

Accuracy*

*

LDA

ED

LOG

M

Type

LO W

HIGH

CAR T

HIG H

C4. 5

LO W

k-Nearest Neighbour KNN KNN -10 100 //

LO LO W W LOW