Expert Systems with Applications 39 (2012) 6738–6753
Contents lists available at SciVerse ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Improved response modeling based on clustering, under-sampling, and ensemble Pilsung Kang a, Sungzoon Cho b,⇑, Douglas L. MacLachlan c a IT Management Programme, International Fusion School, Seoul National University of Science and Technology (Seoultech), 232 Gongneoung ro, Nowon-gu, 139-743 Seoul, South Korea b Department of Industrial Engineering, Seoul National University, 599 Gwanak-ro, Gwanak-gu, 151-744 Seoul, South Korea c Department of Marketing and International Business, Foster School of Business, University of Washington, Seattle, WA 98195, USA
a r t i c l e
i n f o
a b s t r a c t
Keywords: Direct marketing Response modeling Class imbalance Data balancing CRM Clustering Ensemble
The purpose of response modeling for direct marketing is to identify those customers who are likely to purchase a campaigned product, based upon customers’ behavioral history and other information available. Contrary to mass marketing strategy, well-developed response models used for targeting specific customers can contribute profits to firms by not only increasing revenues, but also lowering marketing costs. Endemic in customer data used for response modeling is a class imbalance problem: the proportion of respondents is small relative to non-respondents. In this paper, we propose a novel data balancing method based on clustering, under-sampling, and ensemble to deal with the class imbalance problem, and thus improve response models. Using publicly available response modeling data sets, we compared the proposed method with other data balancing methods in terms of prediction accuracy and profitability. To investigate the usability of the proposed algorithm, we also employed various prediction algorithms when building the response models. Based on the response rate and profit analysis, we found that our proposed method (1) improved the response model by increasing response rate as well as reducing performance variation, and (2) increased total profit by significantly boosting revenue. Ó 2011 Elsevier Ltd. All rights reserved.
1. Introduction
the response model suggests attempting to attract only customers with a relatively high purchase likelihood. Therefore, it saves the money that would have been spent to expose customers to promotional messages who have little interest in buying the product. With increased revenue and lowered cost, firms’ net profit increases (Berry & Linoff, 2004; Elsner, Krafft, & Huchzermeier, 2004; Gönül & Hofstede, 2006; Zhang & Krishnamurthi, 2004). Past studies have shown that while increasing response rate is not an easy task, its impact is quite incredible. For instance, Coenen, Swinnen, Vanhoof, and Wets (2000) pointed out that even a small improvement of response rate can change the total result of a direct mailing campaign from failure to success. Baesens, Viaene, Van den Poel, Vanthienen, and Dedene (2002) illustrated how a small improvement of response rate could result in huge additional profit. In their example, only 1% of increased response rate for an actual mail-order company yielded an additional 500,000 Euro. Knott, Hayes, and Neslin (2002) reported that for a retail bank, only 0.7% of increased response rate tripled total revenue and raised revenue per respondent by 20%. Sun et al. (2006) noted that improvement of the response rate can not only increase profit but also strengthen customer loyalty because properly targeted customers are more likely to be satisfied and stay with the firm over the long run. Encouraged by its noticeable positive effect when successful, a large number of studies have been conducted with the objective of increasing response rate through improving the prediction
Response modeling has become one of the most effective tools for firms seeking to sustain long-term relations with their customers (Berry & Linoff, 2004; Gönül & Hofstede, 2006; Sun, Li, & Zhou, 2006). The goal of response modeling is to identify customers who are likely to purchase a product, based on customers’ purchase history and other information. Based on model predictions, firms attempt to induce higher potential buyers to purchase the campaigned product using their communication channels, e.g., phone, mailed catalog, or e-mail. A well-developed response model can contribute to business in two ways. First, it increases total revenue. The customers during the marketing campaign are typically divided into two groups: one group who would buy the product anyway whether or not they are targeted, and the other group who would not buy the product had they not been targeted. By timely reminding the latter group of what they might need, they may be persuaded to open their wallets. Thus, the additional sales made to those customers are the obvious contribution of the response model. Second, it lowers total marketing cost. Generally, mass advertising is extremely expensive, since a customer’s average likelihood of purchase is very low. Contrary to mass marketing, ⇑ Corresponding author. E-mail addresses:
[email protected] (P. Kang),
[email protected] (S. Cho),
[email protected] (D.L. MacLachlan). 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.12.028
P. Kang et al. / Expert Systems with Applications 39 (2012) 6738–6753
6739
Cost differentiation Algorithm modification Boundary alignment Handling Class Imbalance Under-sampling Data balancing Over-sampling
Fig. 1. Approaches to handling class imbalance.
algorithms used in response modeling. Logistic regression has been widely employed as a base model due to its simplicity and availability (Aaker, Kumar, & Day, 2001; Hosmer & Lemeshow, 1989). Besides logistic regression, stochastic RFM models (Colombo & Jiang, 1999) and hazard function models (Gönül, Kim, & Shi, 2000) were proposed from statistics traditions, while artificial neural networks (Baesens et al., 2002; Kaefer, Heilman, & Ramenofsky, 2005), bagging artificial neural networks (Ha, Cho, & MacLachlan, 2005), Bayesian neural networks (Baesens et al., 2002) support vector machines (Shin & Cho, 2006), and decision trees (Coenen et al., 2000) were proposed from pattern recognition and data mining researchers. The most prevalent difficulty of response modeling is the class imbalance problem. In classification tasks, class imbalance occurs when the incidence of one class extremely outnumbers that of other classes. Class imbalance usually degrades the performance of classification algorithms. Most classification algorithms require sufficient instances from all classes to yield stable models that provide unbiased classification. If one class greatly outnumbers other classes, classification results tend to be biased toward the majority class. For customer databases used for response modeling, it is common that non-respondents overwhelmingly outnumber respondents. For example, less than 10% of customers are respondents in the DMEF4 data set used in Ha et al. (2005) and Shin and Cho (2006), while only about 6% of customers are respondents in the CoIL Challenge 2000 data set (van der Putten, de Ruiter, & van Someren, 2000). To make matters worse, response rates in general direct marketing situations are often much lower. If an appropriate remedy for class imbalance is not taken, classification algorithms employed by response modeling are likely to judge most customers as not to respond, which leads to a high opportunity cost. For that reason, handling the class imbalance of customer data has been recognized as a critical factor for the success of direct marketing (Blaszczyn´ski, Dembczyn´ski, Kotlowski, & Pawlowski, 2006; Hill, Provost, & Volinsky, 2006; Lai, Wang, Ling, Shi, & Zhang, 2006; Ling & Li, 1998). Setting response modeling aside, class imbalance is a common symptom of classification tasks in many subject area, such as image processing (Kubat, Holte, & Matwin, 1998; Yan, Liu, Jin, & Hauptmann, 2003), remote sensing (Bruzzone & Serpico, 1997), and medical diagnosis (Lee, Cho, & Shin, 2008; Pizzi, Vivanco, & Somorjai, 2001). Therefore, a number of methods to overcome class imbalance have been proposed, which can be grouped into two categories, algorithm modification and data balancing as shown in Fig. 1. Methods based on algorithm modification insert an additional specialized mechanism into the original algorithm. There are
two ways to do this: (1) giving different misclassification costs to each class, or (2) shifting the decision threshold toward the minority class. For example, Wu and Chang (2003) proposed giving a larger misclassification cost to the minority class than to the majority class and modifying the kernel matrix when training support vector machines.1 Bruzzone and Serpico (1997) divided the training process of neural networks into two phases. In the first phase, neural networks were trained with misclassification costs that were inversely proportional to the number of patterns in each class.2 In the second phase, using the obtained weights in the first phase as the initial weights, networks were trained again to minimize mean squared error (MSE). Huang, Yang, King, and Lyu (2004) tried to tackle the class imbalance by training a biased ‘‘Minimax’’ machine. In the Minimax machine, the objective function was formulated in order to maximize the accuracy of the minority class classification given a lower bound of majority class accuracy. Data balancing methods build a new training data set in which all classes are well-balanced, using different sampling strategies for each class. They have an advantage over algorithm modification methods in that they are universal. Because data balancing methods work independently from classification algorithms, they can be combined with any classifiers while algorithm modifications work well only with the particular classifiers for which they are designed.3 Under-sampling and over-sampling are two major recipes for data balancing. Under-sampling reduces the number of majority class instances while keeping all the minority class instances. The portion of minority class entities in the training data increases as a consequence. Under-sampling is effective in reducing training time, but it often distorts the class distribution because a large number of majority class instances are removed. Random sampling is the simplest way to implement under-sampling. In random under-sampling, a set of majority class instances is selected at random and combined with the minority class patterns. SHRINK (Kubat, Holte, & Matwin, 1997) and one-sided selection (OSS) (Kubat & Matwin, 1997) are other well-known under-sampling methods. Fig. 2 shows the OSS algorithm. OSS removes majority class instances identified as either noise, redundant, or borderline. A ‘‘noise’’ instance is surrounded by the other class’s instances while a ‘‘redundant’’ instance is surrounded by the same class’s instances.
1 What the data mining literature denotes as ‘‘training’’, the statistics literature calls fitting or estimating. 2 ‘‘Patterns’’ in the data mining terminology corresponds to vectors of observations in statistics. 3 The term ‘‘classifier’’ is used in data mining to denote the model or rule used to classify entities.
6740
P. Kang et al. / Expert Systems with Applications 39 (2012) 6738–6753
Fig. 2. The algorithm of one-sided selection (OSS).
An instance is identified as being ‘‘borderline’’ if instances from different classes are mixed in its neighborhood. After removing noise, redundant, and borderline instances from the majority class, the number of remaining majority class instances becomes smaller. Although OSS reduces only a small number of majority class instances, evident improvements have been reported in practice (Kubat & Matwin, 1997; Laurikkala, 2001). Over-sampling increases the number of minority class instances while keeping all the majority class instances. After over-sampling is done, the number of minority class instances increases. Consequently, the portion of minority class in the training data increases. Of course, the total number of training data increases as well. Oversampling is able to preserve the original data distribution, but it needs more time to train a classifier because it increases the total number of training instances. Random sampling is the simplest way to implement over-sampling. In random over-sampling, minority class instances are duplicated at random and combined with the majority class instances. SMOTE (Chawla, Bowyer, Hall, & Kegelmeyer, 1998) and SMOTE boost (Chawla, Lazarevic, Hall, & Bowyer, 1994) are other well-known over-sampling methods. The SMOTE algorithm is illustrated in Fig. 3. SMOTE synthetically generates minority class instances using a nearest neighbor rule. For each minority class instance, one out of k nearest neighbors is randomly selected. Then, a synthetic instance is generated which lies in the line segment between the instance and the selected neighbor. There are user-specified parameters in SMOTE: the amount of duplication (D) and the number of nearest neighbors (k). N controls how many synthetic instances are generated while
k controls how many candidates should be considered when drawing the line segment. A larger N softens the class imbalance more than smaller D while a larger k makes it possible that the minority class instances spread wider than a smaller k. Although data balancing methods are known to be effective, there are some inherent problems when used for response modeling. First, under-sampling has a risk of large variation. In other words, it does not guarantee a stable response rate, because under-sampling typically discards a large amount of non-respondent data. If the sampled non-respondents represent the entire nonrespondent class well, it may lead the model an accurate response rate. If, on the other hand, all non-respondents are not well-explained by the sampled data, the response modeling performance may not be satisfactory. Second, over-sampling may provide wrong information on the customers. Since some over-sampling techniques generate virtual respondents from a limited pool of actual respondents, synthetically generated respondents can be unrealistic and could distort the characteristics of respondent customers. In this paper, in order to deal with the aforementioned limitations, a new data balancing method based on clustering, undersampling, and ensemble, named CUE, is proposed. Our method is based on under-sampling because over-sampling is an unrealistic and risky strategy. To reduce the large performance variation of under-sampling caused by sampling bias, we sample the nonrespondents in a way such that the information loss can be minimized. To do so, we first partition the non-respondents into clusters using a K-Means clustering algorithm. Then, non-respondents are sampled proportional to the size of each cluster. By doing so, sampled non-respondents can preserve the information of all nonrespondents as much as possible. To improve prediction accuracy and reduce performance variation further, we build multiple different ‘‘balanced’’ training data sets by repeating the sampling process. These balanced training data sets are provided to the prediction algorithm, and the final prediction is made by combining the prediction results of individual prediction models based on corresponding training data sets. This ensemble approach is expected to enhance the response rate accuracy and achieve more stable performance. To verify the proposed method, we build response models using two actual customer data sets. The proposed method is compared with four well-known data balancing methods across different classification algorithms. The rest of this paper is structured as follows. In Section 2, we demonstrate the proposed methods with brief backgrounds. In Section 3, we explain the experimental settings including data sets, prediction algorithms, performance measures, and benchmark balancing methods. In Section 4, we analyze the response models based on each of the prediction algorithm/data balancing method pairs in terms of response rate and profit. In the concluding Section 5, we discuss future research possibilities. 2. Methodology 2.1. Backgrounds: clustering and ensemble
Fig. 3. The algorithm of SMOTE.
In a typical customer data set used for direct marketing (created as the result of a market test), a large number of customers are labeled as either respondents or non-respondents to the market offering. As mentioned earlier, the data set is usually extremely imbalanced in that non-respondents substantially outnumber respondents. It is reasonable to assume that the respondents have some characteristics in common, which lead them to buy the product or service promoted. It is naive, on the other hand, to consider all non-respondents as a homogeneous group (Gehrt & Shim, 1998). People have many reasons not to respond. Therefore, it is more realistic to regard the non-respondents as a mixed heterogeneous group.
P. Kang et al. / Expert Systems with Applications 39 (2012) 6738–6753
Grouping a large number of customers into relatively homogeneous market segment and understanding each segment is an important task of marketing (Hwang, Jung, & Suh, 2004; Jonker, Piersma, & del Poel, 2004; Levin & Zahavi, 2001; Reutterer, Mild, Natter, & Taudes, 2006; Tsai & Chiu, 2004). Clustering is the most commonly used analysis technique for customer segmentation. Clustering is a data analysis tool that partitions an entire data set into some number of meaningful subsets or groups, so called clusters, in response to a predefined criterion (Berkin, 2006; Han & Kamber, 2006; Jain, Murty, & Flynn, 1999; Xu & Wunsch, 2005). Given a set of entities or instances {xi, yi : i = 1, . . . , N} where xi 2 Rd is a d-dimensional input vector, the goal of clustering is to assign instances to a finite set of K clusters where any pair of clusters are exclusive while the union of all clusters is equal to the entire data set X as follows,
X ¼ C 1 [ C 2 . . . [ C K ; C i \ C j ¼ /;
i – j:
ð1Þ
A good clustering algorithm results in a cluster structure where individual clusters are heterogeneous to each other, but instances within each cluster are homogeneous. Depending on how the clusters are structured, clustering algorithms can be divided into hierarchical, partitioning, or density-based methods (Berkin, 2006; Jain et al., 1999). An ensemble, also known in the data mining field as a committee machine, combines multiple classifiers to improve the prediction accuracy of classification algorithms (Bishop, 1995). Given a set of instances {xi, yi : i = 1, . . . , N} where xi 2 Rd is a d-dimensional input vector and yi 2 {0, 1} is a target value, prediction error can be decomposed into bias and variance as follows,
^ðxÞ yðxÞg2 ¼ fE½y ^ðxÞ yðxÞg2 E½fy ^ðxÞ E½y ^ðxÞg2 ; E½fy |fflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} ðbiasÞ
2
ð2Þ
v ariance
^ðxÞ are the expected average, actual target valwhere E[], y(x) and y ues, and predicted values, respectively. Bias is the error between the average of model predictions and the actual target, while variance is the error between the individual model predictions and their average. In general, simple models, such as logistic regression, tend to have a larger bias, while complex models, such as neural networks or support vector machines (SVM) tend to have a larger variance. The purpose of an ensemble is to reduce prediction error by smoothing either bias or variance. The benefit of an ensemble is clear. Let there be L individual classifiers with the prediction error Ei, and assume that we make the ensemble prediction by simply averaging the predictions of all classifiers. Let EEnsemble be the prediction error of the ensemble, while EAV be the average prediction error of individual models. If the models are completely independent, the prediction error of the ensemble becomes as follows (Bishop, 1995),
EEnsemble ¼
1 EAV : L
ð3Þ
Theoretically, the more individual models included, the lower the prediction error. Due to less than perfect independence among individual models, error reduction of the ensemble is not as dramatic as Eq. (3) in practice. However, an ensemble is still attractive because it reduces the prediction error or at least does not increase it because the following relation always holds (Bishop, 1995),
EEnsemble 6 EAV :
ð4Þ
There are several ways to build an ensemble classifier. Breiman (1996) proposed the bootstrap aggregating or bagging technique. Bagging constructs multiple training sets by sampling instances with replacement. Then, prediction models are trained on individual training sets. Bagging is known to be very effective for reducing variance; thus it is generally used with complex classification algo-
6741
rithms. Boosting (Freund & Schapire, 1997) is another samplingbased ensemble method. While bagging constructs multiple training data sets simultaneously, boosting constructs training data sets step by step. Once a classification algorithm is trained with the original training set, correctly classified instances and misclassified instances are given different selection weights such that currently misclassified instances have more chances to be included in the next training set. By repeating this procedure, multiple training sets are consecutively generated. Boosting makes a final prediction by aggregating the predictions of all classifiers according to their own combining weights, which are assigned based on the corresponding training accuracy. In contrast to bagging and boosting, both of which are based on multiple training sets, an ensemble can be constructed with multiple classifiers on the same training set. For example, heterogeneous classification algorithms are trained on the same training set, and the prediction is made by combining the results of the classifiers (Kedarisetti, Kurgan, & Dick, 2006; Tsoumakas, Katakis, & Vlahavas, 2004). In addition, an identical classification algorithm with various model parameters, e.g., support vector machines with various kernel and costs combinations (Valentini & Dietterich, 2004), can be used for constructing ensemble. Also, both multiple training sets and heterogeneous classifiers can be used simultaneously. 2.2. CUE: data balancing based on clustering, under-sampling, and ensemble We propose a new data balancing method based on clustering, under-sampling, and ensemble, named CUE. The procedure of CUE is illustrated in Fig. 4. In Step 1, the customer data set is divided into respondents and non-respondents. In Step 2, clustering is conducted to group all non-respondents into some number of homogeneous clusters. In this paper, we adopted as a segmentation method K-Means clustering (MacQueen, 1967), which is by far the most widely used clustering algorithm. It finds K clusters by minimizing the following within-cluster sum of squared error,
arg min C
K X X
kxj ci k2 ;
ð5Þ
i¼1 xj 2C i
where ci is the centroid of Ci and C is the union of all clusters (C = C1 [ [ CK). K-Means clustering is a repetition of two alternate procedures. In the first step, all instances are assigned to their nearest centroid. In the second step, the centroid of each cluster is recomputed with newly assigned instances. The iterations continue until a predefined stopping criterion is met. In practice, K-Means clustering has several benefits: it works well with Euclidean distance metric (Berkin, 2006); it allows straightforward parallelization (Dhillon & Modha, 1999); and it does not depend on data ordering (Trujillo & Izquierdo, 2005). However, its final cluster structure strongly relies on the choice of initial seeds (Berkin, 2006; Han & Kamber, 2006; Jain et al., 1999; Xu & Wunsch, 2005). To overcome this problem, i.e., to make the clustering result more stable, we used the CSI algorithm proposed by Kang and Cho (2009) for seeds initialization. The CSI algorithm selects the initial seeds based on three criteria: centrality, sparsity, and isotropy. It was empirically found that the CSI algorithm made K-Means clustering result much more reliable than other seed-initialization methods. In Step 3, multiple training data sets are constructed as follows. Each training data set contains all respondents and sampled nonrespondents such that the number of sampled non-respondents is equivalent to that of respondents so as to eliminate class imbalance. When constructing the balanced training data sets, the sampling frequency of non-respondents is proportional to the size of
6742
P. Kang et al. / Expert Systems with Applications 39 (2012) 6738–6753
Customer data
Training set 1
Training set 2
Sampled Nonrespondents
Respondents
Construct multiple training sets by combining the respondents and sampled non- respondents from each segment.
Cluster K
Cluster 2
Sampled Nonrespondents
Step 3:
Cluster 1
Respondents
Divide the non- respondents using clustering.
Sampled Nonrespondents
Step 2:
Non-respondents
Respondents
Separate the customer data into respondents and nonrespondents.
Respondents
Step 1:
: Non- respondents
Respondents
: Respondents,
Training set N
Step 4: Train each prediction model with corresponding training set.
Prediction model 1
Prediction model 2
Prediction model N
Step 5: Make a prediction by aggregating the prediction results.
Is he/she going to respond? Fig. 4. The procedure of CUE.
the cluster. Let us give a simple example here. Assume that there are 100 respondents and 1000 non-respondents in a customer data set. Assume also that the optimal segmentation structure of nonrespondents is that 500, 300, and 200 customers belong to the first, second, and third clusters, respectively. Under this circumstance, each training data set is constructed by combining all 100 respondents and 100 sampled non-respondents, in which 50, 30, and 20 non-respondents selected from the first, second, and third segment, respectively. By doing so, non-respondents in each training set can preserve the information of the original structure of the entire non-respondents class as much as possible. In Step 4, prediction models are trained based on individual training sets constructed in Step 3. Then, the final prediction is made by aggregating the prediction results of all models in Step 5. We expect the CUE to have some advantages over the other data balancing methods. First, it reduces the sampling bias by preserving the original structure of non-respondents as much as possible. Sampling bias often leads to unstable prediction performance. A prediction model with a large variation is not desired for direct marketing because a small change of prediction accuracy would
change the entire campaign result. Contrary to random samplingbased methods, CUE can achieve a more stable prediction performance in which decision makers can trust. Second, by adopting an ensemble scheme, CUE is expected to improve the prediction performance. As illustrated before, an ensemble at least does not increase the prediction error, and has demonstrated its superior prediction ability in many practical areas. By not only increasing but also stabilizing prediction accuracy, CUE can improve the performance of response modeling.
3. Experimental settings 3.1. Data In order to verify the proposed data balancing method, we built response models using two publicly available customer data sets. The first data set is CoIL Challenge 2000 data set.4 The data is 4
http://www.liacs.nl/putten/library/cc2000/report2.html.
6743
P. Kang et al. / Expert Systems with Applications 39 (2012) 6738–6753
provided by the Dutch data mining company ‘‘Sentiment Machine Research’’ for a data mining competition purpose. The main tasks released with the data are (1) to predict which customers are potentially interested in a caravan insurance policy, and (2) to describe the actual or potential customers; and possibly explain why these customers buy a caravan policy. For the sake of response modeling, we only focused on the first task. The training set contains 5822 customers and only 348 (5.98%) of them are insurance purchasers. There are 85 explanatory variables including 42 product usage variables and 43 socio-demographic variables derived from zip area codes. The test set contains 4000 customers and only 238 customers are insurance purchasers (respondents). Because our goal in this experiment was to compare the performance across data balancing methods, we used all variables when building the response models. The DMEF4 customer data set is provided by the ‘‘Direct Marketing Educational Foundation’’ for research purposes.5 The main task released with the DMEF4 data set is to discover customers who purchased the product in the test period (1992/9–1992/12) based on demographic and historic purchase information available in the reference or train period (1971/12–1992/06) (Malthouse, 2002; Ha et al., 2005; Shin & Cho, 2006). The original DMEF4 data set consists of 101,532 customers and 91 input variables. In this paper, we used 15 input variables used in (Lee & Cho, 2007) by deleting two highly correlated variables from 17 variables selected by Malthouse (2001). Among the 15 input variables, five variables are original variables while the other eight variables are derived variables. The description and formulation of all variables are shown in Table 1. The original DMEF4 data set has two target variables: TARGDOL and TARGORD. TARGDOL is the purchase dollar amount during the test period and TARGORD is the number of orders made during the test period. Since we focused on whether a customer responds to the campaign or not, we created a binary target value: TARGRES. A customer’s TARGRES is 1 if both TARGDOL and TARGORD are greater than 0, otherwise it is 0. Since separate training and test set does not exist, we randomly split the data into 10% for training and 90% for test in order to avoid over-fitting and to emphasize test results. In addition, we employed 10 different realizations (training-test combinations) to achieve statistically valid results.
where wji is the weight between ith input node and jth hidden node. The activated value of hidden node j, zj, can be obtained as follows,
3.2. Prediction models
zj ¼ gðaj Þ;
We employed four classification algorithms as prediction models: logistic regression, multi-layer perceptron (MLP) neural network, k-nearest neighbor classification, and support vector machine (SVM). Logistic regression has a strong theoretical foundation and allows one to interpret coefficients so that one can examine how much a change of an input variable affects the classification result. Logistic regression estimates coefficients b that link the odds ratio of the target and input vector as follows,
where g(a) is an activation function, which is logistic sigmoid in our experiment. Besides logistic sigmoid, step function or hyper-tangent function can also be used. The output value of the output node is obtained by the weighted sum of the activated values of all hidden nodes.
loge
ð6Þ
where P(y = 1) is the probability of response. Logit transformation is preferred because the relationship between the variables should be nonlinear, and the probability of response should take a value between 0 and 1. Thus, logistic regression estimates the likelihood of purchase (P(y = 1)) as follows,
expðbT x þ b0 Þ : 1 þ expðbT x þ b0 Þ
ð7Þ
The coefficients of logistic regression can be found by either maximum likelihood estimation (MLE) or expectation–maximization (EM) algorithm. 5
Name
http://www.the-dma-org/dmef/.
Description
Formulation
Original variables Purseas Number of seasons with a purchase Falord LTD fall orders Ordtyr Number of orders this year Puryear Number of years with a purchase Sprord LTD spring orders Derived variables Recency Order days since 10/1992 Tran53 Tran54 Tran55 Tran38
I (180 6 Recency 6 270) I (270 6 Recency 6 366) I (366 6 Recency 6 730) 1 Recency
Comb2
Number of product groups
Tran46 Tran42
Interaction between the number of orders Interaction between LTD orders and LTD spring orders Inverse of latest-season items
Tran44 Tran25
P14 m¼1 ProdGrpm pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Comb2 log(1 + Ordtyr Sprord) pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Ordhist Sprord 1 1þLorditm
MLP is a well-known non-parametric algorithm based on empirical risk minimization. It has been widely adopted in marketing applications (Ahn, Choi, & Han, 2007; Gruca, Klemz, & Petersen, 1999; Kim, Street, Russell, & Menczer, 2005). We employed 3-layer feed-forward neural networks where there are d input nodes, h hidden nodes, and one output node. The output of jth hidden node, aj, is the weighted sum of d input values and the bias.
aj ¼
d X
ð1Þ
wji xi ;
ð8Þ
i¼0 ð1Þ
^¼ y
h X
gðaÞ ¼
1 ; 1 þ expðaÞ
ð2Þ
wj zj ;
ð9Þ
ð10Þ
j¼0
Pðy ¼ 1Þ ¼ bT x þ b0 ; 1 Pðy ¼ 1Þ
Pðy ¼ 1Þ ¼
Table 1 15 Input variables used for building response models.
ð2Þ
where wj is the weight between jth hidden node and the output ^i , can be expressed with input node. The output of the network, y values and weights as follows,
^ðxi Þ ¼ ^i ¼ y y
h X
ð2Þ
wj g
j¼0
d X
! ð1Þ
wji xi :
ð11Þ
i¼0
MLP finds the optimal weight values by minimizing the least squared error between the targets and the network results.
E¼
n 1X ^ i Þ2 : ðy y 2 i¼1 i
ð12Þ
To solve this optimization problem, various algorithms such as backpropagation, Newton algorithm, and Levenberg–Marquardt algorithm can be used.
6744
P. Kang et al. / Expert Systems with Applications 39 (2012) 6738–6753
The k-nearest neighbor classification (k-NN) predicts whether a customer will respond or not based on how similar customers, called neighbors, behave. k-NN first computes a measure of similarity, Euclidean distance in our experiment, between the test customer and the reference customers. Then, a predefined number ^i , is then estimated of neighbors are selected. The response score, y by combining the response labels of the selected neighbors as follows,
X
^i ¼ y
wj yj ;
ð13Þ
j2kNNðxi Þ
where k-NN (xi) denotes the index set of k nearest neighbors of xi, ^i is close while wj denotes the weight assigned to the neighbor xj. If y to 1, then he/she is predicted as respondent, otherwise predicted as non-respondent. There are two issues for k-NN: (1) how many neighbors should be selected? (2) how the weights should be assigned? To address them, we adopted the locally linear reconstruction algorithm proposed by Kang and Cho (2008). SVM is based on the structural risk minimization (SRM) principle and it has been well known for its high generalization ability. Therefore, SVM has been successfully implemented in a wide range of applications such as text classification (Zhuang et al., 2005), image retrieval (Hoi, Chan, Huang, Lyu, & King, 2004), medical diagnosis (Duan, Rajapakse, Wang, & Azuaje, 2005), and marketing (Cui & Curry, 2005). SVM tries to find the hyperplane f(x) = wTu(x) + b, which maximizes the margin between two classes in another input space mapped by a non-linear mapping u. To do so, SVM aims to solve the following minimization problem.
min
X 1 kwk2 þ C ni ; 2 i
s:t
yi ðwT uðxi Þ þ bÞ P 1 ni ;
w;b
ni P 0;
ð14Þ
concerned with maximizing the margin The first term while the second term i ni is concerned with minimizing the misclassification cost for a non-separable case. C controls the trade-off between the margin width and the misclassification cost while ni is a slack variable. The primal Lagrangian of Eq. (14) becomes 1 kwk2 is 2 P
X X
X 1 LP ¼ kwk2 þ C ni ai yi ðwT uðxi Þ þ bÞ 1 þ ni li ni : 2 i i i ð15Þ By Karush–Kuhn–Tucker condition, the optimal solution must satisfy the following equations.
X @L ¼0!w ai yi uðxi Þ ¼ 0; @w i X @L ai yi ¼ 0; ¼0! @b i @L ¼ 0 ! 0 6 ai 6 C; @ni
s:t
Predicted
Respondents Non-respondents
i ¼ 1; . . . ; N:
X 1X ai aj yi yj uðxi ÞT uðxj Þ þ ai ; 2 i;j i X ai yi ¼ 0;
Non-respondents
True respondents (TR) False respondents (FR)
False non-respondents (FN) True non-respondents (TN)
late the inner products so that we do not need to define the explicit mapping function from the input space to the feature space. Some possible kernels are linear, polynomial, gaussian, and hypertangent. By solving Eq. (19), we obtain the following classifier:
yðxÞ ¼ sign
" X
#
ai yi Kðx; xi Þ þ b :
ð20Þ
i
3.3. Performance measures For class imbalanced data, simple accuracy is not an appropriate metric because it depends highly on the relative size of the majority class. Let us consider a simple direct marketing example. Assume that there are 1000 customers and only 10 of them respond to the campaign. If a classifier makes a decision that all customers will not respond to the campaign, it achieves 99% simple accuracy. However, the classifier is of no use because none of the respondents is identified. Therefore, the classification accuracy of both majority and minority class should be considered in response modeling. Given a set of test customers and a classifier with a particular threshold, a confusion matrix can be built by counting the number of customers according to the actual class and predicted class of the customers as shown in Table 2. The simple accuracy is calculated as follows,
Simple Accuracy ¼
TR þ TN : TR þ TN þ FR þ FN
ð21Þ
Since TN greatly outnumbers TR in typical response modeling, simple accuracy is not able to reflect the accuracy of the respondents and non-respondents fairly. Therefore, we employ balanced correction rate (BCR) as an alternative performance measure instead of simple accuracy. BCR takes the geometric mean of the accuracy of respondents and non-respondents as in Eq. (22) so that a low value of either of them results in a low BCR.
BCR ¼
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi TR TN TR þ FN TN þ FR
ð22Þ
3.4. Benchmark balancing methods
ð17Þ
In order to compare the performance of CUE with other data balancing methods, we employed four balancing methods: two based on under-sampling and two based on over-sampling. A brief description of each method is shown in Table 3. As under-sampling based methods, random under-sampling (RUS) and one-sided selection (OSS) were employed. In random under-sampling, a set of non-respondents were selected at random and combined with the respondents. The number of selected non-respondents was equal to the number of respondents. OSS removed non-respondents that were considered as either noise, redundant, or borderline as shown in Fig. 2. As over-sampling based methods, random over-sampling and synthetic minority over-sampling technique (SMOTE) were employed. In random over-sampling, respondents were duplicated until the number of respondents became equal to a predefined level. SMOTE synthetically generated respondents based on a nearest neighbor rule as shown in Fig. 3. In total,
ð18Þ
ð19Þ
i
0 6 ai 6 C;
Respondents
ð16Þ
By replacing Eqs. (16)–(18) to the LP , we obtain Wolfe’s dual problem.
a
Actual
i ¼ 1; . . . ; N:
max
Table 2 Confusion matrix.
8i:
Since we only need the inner products of input vectors in the feature space, we employ a kernel trick K(xi, xj) = u(xi)Tu(xj) to calcu-
6745
P. Kang et al. / Expert Systems with Applications 39 (2012) 6738–6753 Table 3 The description of each training data reconstruction method. Sampling principle
Balancing method
Abbreviation
No-sampling Under-sampling
No-sampling Random under-sampling One-sided selection Random over-sampling Synthetic minority over-sampling technique Clustering & ensemble
NS RUS OSS ROS SMOTE
Over-sampling
Under-sampling
CUE
including no sampling (which does not treat class imbalance) and the proposed CUE method, six data balancing methods were used. There were some algorithm- or user-specific parameters in our experiment. As algorithm-oriented parameters, we needed to determine the number of hidden nodes for MLP and the kernel function and misclassification cost for SVM. For MLP, we selected the best hidden node among 5, 10, 15, 20, 25, and 30 for each balancing method by 5-fold cross validation. For SVM, we used a Gaussian function as the kernel function. The width of the Gaussian kernel (r) and the misclassification cost (C) were selected by 5-fold cross validation. As user-specific parameters for SMOTE, the number of neighbors (k) was set to 5, and the duplication percentage (D) was set to 1000 for the DMEF4 data set and 1700 for the CoIL Challenge 2000 data set in order to make the numbers of sampled respondents equal to those of non-respondents. In the segmentation step of the CUE (Step 2), the number of clusters (K) is determined based on the SD validity index (Halkidi, Vazirgiannis, & Batistakis, 2000). 4. Experimental results For both CoIL Challenge 2000 and DMEF4 data sets, a total of 24 individual response models, each of which was built based on one of the aforementioned data balancing methods with each classification algorithm, were evaluated and analyzed in terms of BCR, true response rate/true non-response rate graph (TRR–TNR graph), and lift chart. Because the actual dollar amounts spent by the respondents are known for the DMEF4 data set, an additional profit analysis is also conducted for it. 4.1. CoIL Challenge 2000 results The mean and standard deviation of accuracy and BCR of 24 response models with CoIL Challenge 2000 data set are shown in Table 4. Regardless of classification algorithms, all data balancing methods that were used improved BCR. In other words, any method that attempted to alleviate class imbalance gave rise to perfor-
mance improvement. Among data balancing methods, CUE was found to be the best for all four classifiers. Not only did CUE outperform the other methods on average, its variations were much smaller than those of others. Also, note that the improvement of OSS over NS seemed not as significant as that of RUS, ROS, SMOTE, and CUE. This is because OSS was designed to remove the instances that are not qualified to be used for model building. When the majority class extremely outnumbers the minority class, as in response modeling (non-respondents respondents), the reduced data set would be still imbalanced. In our experiments, the proportion of the respondents increased from 6% to 8% when OSS was completed. Among classification algorithms, SVM was found to be the best model under the imbalanced circumstances (see from the table that SVM was best for NS and OSS). When the class imbalance was eliminated, such as RUS, ROS, SMOTE, and CUE, k-NN reported the best BCRs. Note that MLP resulted in the worst BCRs with all balancing methods. Especially with NS or OSS, not only its BCR was very low, but also its variation was huge. Although the variations of other classifiers were less then 2% with NS and OSS, MLP’s variations were 137.11% and 91.18% for NS and OSS, respectively. We suspected that the inherent nature of MLP (stuck in local optima), a relatively large number of attributes with insufficient instances (85 attributes with 5822 customers), and unalleviated class imbalance might have caused the unexpectedly large variations. Fig. 5 shows the BCR and its variation improvement of CUE over other balancing methods. Note that the y-axis of Fig. 5(a) is truncated at 15%. We compared the CUE with RUS, ROS, and SMOTE only because the results of NS and OSS were not as good as others. CUE reported at least 3% higher BCR than RUS and ROS for all response models, while at least 7% higher BCR than SMOTE Fig. 5(a). The BCR improvement was especially noticeable with MLP: BCR was increased by 8%, 13%, and 35% over RUS, ROS, and SMOTE, respectively. This explains why the BCR of MLP with CUE was comparable with other classifiers despite the inferior performance of MLP with other balancing methods. With respect to BCR variation, the effect of CUE is very clear. The variation of BCR was lowered by at least 50% with CUE Fig. 5(a). For most cases, CUE reduced more than 70% of BCR variations. This is very important because only a small change in response rate can dramatically influence the success of a marketing campaign as in Section 1. When validating response modeling before undertaking a direct marketing program, the more stable the classifiers, the easier and more accurately it will be to analyze the expected consequence of the campaign. Fig. 6 shows the true response rate (TRR, x-axis) against true non-response rate (TNR, y-axis) for each classifier. A perfect model would correctly predict both respondents and non-respondents
Table 4 The mean and standard deviation of accuracy (ACC) and balanced correction rate (BCR) of each response model with CoIL Challenge 2000 data set. Classifier
Logistic regression MLP k-NN SVM
Method Measure
NS (%)
OSS (%)
RUS (%)
ROS (%)
SMOTE (%)
CUE (%)
ACC BCR ACC BCR ACC BCR ACC BCR
0.939 ± 0.00 0.129 ± 0.00 0.938 ± 0.34 0.071 ± 137.11 0.918 ± 0.00 0.313 ± 0.00 0.917 ± 0.00 0.337b ± 0.00
0.937 ± 0.07 0.206 ± 1.99 0.935 ± 0.81 0.152 ± 91.18 0.909 ± 0.08 0.375 ± 0.05 0.902 ± 0.01 0.395b ± 0.41
0.641 ± 1.45 0.648 ± 1.48 0.598 ± 12.14 0.614 ± 3.34 0.696 ± 1.46 0.660b ± 1.38 0.691 ± 1.58 0.656 ± 1.43
0.693 ± 0.72 0.654 ± 0.69 0.748 ± 6.28 0.585 ± 8.29 0.715 ± 0.70 0.666b ± 1.07 0.710 ± 0.68 0.657 ± 1.01
0.782 ± 0.28 0.631 ± 0.78 0.820 ± 3.47 0.491 ± 4.78 0.801 ± 0.51 0.638b ± 0.63 0.797 ± 0.43 0.629 ± 0.68
0.699 ± 0.14 0.673a ± 0.21 0.673 ± 2.29 0.662a ± 0.78 0.717 ± 0.21 0.681a,b ± 0.29 0.706 ± 0.24 0.671a ± 0.21
The numbers after ± are standard deviations. a The best BCRs among the balancing methods with identical classifiers b The best BCRs among the classifiers with identical balancing methods.
6746
P. Kang et al. / Expert Systems with Applications 39 (2012) 6738–6753
Fig. 5. BCR and its variation improvements of CUE over other balancing methods with CoIL Challenge 2000 data set.
SMOTE
0.8
ROS 0.6
CUE
RUS
0.4 0.2 0 0
0.2
1 True non−response rate (TNR)
True non−response rate (TNR)
NS OSS
NS
0.4 0.6 0.8 True response rate (TRR)
SMOTE ROS RUS
CUE
0.4 0.2 0 0
0.2 0.4 0.6 0.8 True response rate (TRR)
1
SMOTE ROS
0.8
CUE 0.6 RUS 0.4 0.2
0.2
1
OSS
0.8 0.6
1 NS OSS
0 0
1
True non−response rate (TNR)
True non−response rate (TNR)
1
NS
0.4 0.6 0.8 True response rate (TRR)
1
OSS SMOTE
0.8 0.6
CUE ROS RUS
0.4 0.2 0 0
0.2 0.4 0.6 0.8 True response rate (TRR)
1
Fig. 6. True response rate (TRR) versus true non-response rate (TNR) for each classifier with CoIL Challenge 2000 data set.
without any error. Thus, it would be placed on the top-right: (1, 1) coordinate in the figure. A model placed in the top-left, (0, 1) coordinate in the figure, predicts all customers not to respond, which is obviously worthless. A model placed in the bottom-right, (1, 0) coordinate in the figure, predicts all customers to respond, which conceptually corresponds to mass marketing. In reality, the result of a response model is placed between these two extreme cases because one rate can be increased at the expense of the other. Therefore, a good response model should be placed as close to the toprights as possible.
A few observations can be made from Fig. 6. First, naive use of imbalanced data is almost of no use because NS is placed in the top-left area. Even though k-NN and SVM caught a few respondents, NS showed a tendency to predict almost all customers not to respond. This also explains why NS reported the highest simple accuracy in Table 4, despite its lowest BCR. By predicting most customers not to respond, it safely achieved almost 100% accuracy for non-respondents, which overwhelmingly influenced the simple accuracy. Second, OSS did not improve the true response rate over NS. As mentioned before, OSS is not a perfect balancing method,
6747
P. Kang et al. / Expert Systems with Applications 39 (2012) 6738–6753
NS OSS RUS ROS SMOTE CUE
Lift value
2.6 2.2 1.8
3.2 3
1.4
20%
30% Mailing depth
40%
3.2 3
Lift value
1.8
2.2
1 10%
50%
NS OSS RUS ROS SMOTE CUE
2.6
1.8 1.4 1 10%
2.2
1.4
20%
30% Mailing depth
40%
3.2 3
50%
NS OSS RUS ROS SMOTE CUE
2.6 Lift value
1 10%
NS OSS RUS ROS SMOTE CUE
2.6 Lift value
3.2 3
2.2 1.8 1.4
20%
30% Mailing depth
40%
50%
1 10%
20%
30% Mailing depth
40%
50%
Fig. 7. Lift charts of all balancing methods for each classifier with CoIL Challenge 2000 data set.
thus it is not quite as effective when class imbalance is severe. Third, SMOTE resulted in a smaller TRR and larger TNR than RUS, ROS, and CUE, although it also clearly removed the class imbalance as others. Fourth, CUE seemed the best among balancing methods because it was placed closest to the ideal position (top-right), and balanced TRR and TNR fairly well. RUS and ROS reported comparable results with CUE, but RUS tended to more aggressively classify non-respondent to respondents than CUE, while ROS tended to predict respondents less correctly. Fig. 7 shows the lift charts of all balancing methods for each classifier. Note that the dotted line (20% mailing depth) is the threshold specified by the CoIL Challenge 2000. Contrasting the TRR–TNR graphs, NS and OSS did reasonably well. Even for logistic regression, their lift values with 10% mailing depth were better than others. These seemingly contradictory phenomenon is caused by the decision mechanism of the classifiers under the class imbalanced circumstance. Although classes are severely imbalanced, actual respondents may have greater response likelihoods than nonrespondents. However, it is very risky to lower the threshold due to the excessive non-respondents. As a result, with NS or OSS, classifiers conservatively predicted whether customers would respond or not, thus resulting in high true non-response rates and low true response rates. Other balancing methods, on the other hand, tended to aggressively predict the response to increase true response rates because the costs of misclassifying the nonrespondents are much lower than imbalanced cases. Looking back to the lift charts, balancing methods seemed to help complex classifiers such as MLP, k-NN, and SVM more than a simple logistic regression. For logistic regression, balancing methods were inferior to no-sampling when the mailing depth is 10%. As the mailing depth increased, balancing methods improved performance slightly. For
MLP, k-NN, and SVM, on the other hand, balancing methods worked very well except for SMOTE. Lift values of RUS, ROS, and CUE lie above NS for all mailing depths. Among balancing methods, CUE did best: not only did its lift curve lie considerably above that of NS, but also it dominated all other methods for most cases. This result is consistent with the BCR and TRR–TNR results. 4.2. DMEF4 results The mean and standard deviation of accuracy and BCR of 24 response models with DMEF4 data set are shown in Table 5. The table can be understood similar to that of CoIL Challenge 2000 data set (Table 4) with a few following exceptions. First, all classifiers worked better with the DMEF4 data set than with the CoIL Challenge 2000 data set. The one possible cause of it is the difference of data size: approximately 100,000 (DMEF4) versus 10,000 (CoIL Challenge 2000). Although the class imbalance was severe in the DMEF4 data set, the absolute number of respondents are much larger than that of the CoIL Challenge 2000 data set, and classifiers would have more chances to learn the respondent class. Second, by the same reason, the BCRs of OSS are quite competitive. Third, MLP was found to be better than logistic regression and SVM. Since MLP is based on the empirical risk minimization principle, a greater number of training instances made it stronger. Setting these aside, other results are similar to those of CoIL 2000 Challenge data set; all data balancing methods were found to be effective, CUE outperformed the other methods both in terms of average BCR and its variation, and k-NN reported the outstanding performance. Fig. 8 shows the BCR and its variation improvement of CUE over other balancing methods. There was only one case in which CUE did not increase the BCR. When MLP is used for prediction, CUE
6748
P. Kang et al. / Expert Systems with Applications 39 (2012) 6738–6753
Table 5 The mean and standard deviation of accuracy (ACC) and balanced correction rate (BCR) of each response model with DMEF4 data set. Classifier
Method
Logistic regression MLP k-NN SVM
Measure
NS (%)
OSS (%)
RUS (%)
ROS (%)
SMOTE (%)
CUE (%)
ACC BCR ACC BCR ACC BCR ACC BCR
0.927 ± 0.00 0.564 ± 0.00 0.936 ± 0.95 0.638 ± 10.11 0.925 ± 0.00 0.758b ± 0.00 0.928 ± 0.00 0.590 ± 0.00
0.925 ± 0.00 0.608 ± 0.05 0.933 ± 0.94 0.665 ± 8.04 0.957 ± 0.00 0.791b ± 0.00 0.924 ± 0.00 0.632 ± 0.00
0.809 ± 0.85 0.759 ± 0.66 0.809 ± 1.98 0.775 ± 3.20 0.774 ± 0.85 0.798b ± 0.58 0.813 ± 1.40 0.770 ± 0.55
0.815 ± 0.32 0.757 ± 0.12 0.865 ± 1.90 0.821a,b ± 2.89 0.923 ± 0.07 0.771 ± 0.66 0.810 ± 0.22 0.772 ± 0.12
0.797 ± 0.20 0.768 ± 0.59 0.856 ± 2.14 0.811b ± 1.76 0.914 ± 0.09 0.784 ± 1.13 0.812 ± 0.21 0.770 ± 0.13
0.824 ± 0.22 0.776a ± 0.15 0.861 ± 0.09 0.815 ± 0.76 0.845 ± 0.04 0.837a,b ± 0.20 0.815 ± 0.25 0.781a ± 0.10
The numbers after ± are standard deviations. a The best BCRs among the balancing methods with identical classifiers b The best BCRs among the classifiers with identical balancing methods.
10
100
vs. RUS vs. ROS vs. SMOTE
8
80 60
6
40
4 20
2
0
0
vs. RUS vs. ROS vs. SMOTE
−20 −40
−2 LR
MLP
k−NN
SVM
LR
MLP
k−NN
SVM
Fig. 8. BCR and its variation improvements of CUE over other balancing methods with DMEF4 data set.
resulted a little lower BCR against ROS. Except for that, CUE improved BCR for all classifiers against other balancing methods. When logistic regression or SVM were employed as the base classifier, CUE improved BCR more than 1% over other methods. The BCR improvement was remarkable when k-NN was employed as the base classifier: a minimum of 5% increases resulted. As in the CoIL Challenge 2000 data set, CUE also contributed to reducing the BCR variations. For MLP and k-NN, the BCR variations were lowered more than 50% against other methods. Only one exception was observed in that the BCR variation was increased with logistic regression compared to ROS. Fig. 9 shows the true response rate (TRR, x-axis) against true non-response rate (TNR, y-axis) for each classifier. Note that the lower bound of each rate is limited to 0.3, not 0. Similar to CoIL Challenge 2000 data set, using imbalanced data naively is not a good strategy because NS was not as effective as others to capture the respondents. With the DMEF4 data set, OSS worked better than with the CoIL Challenge 2000 data set. Not only did OSS capture more respondents than NS, but also it had more respondents captured than ROS and SMOTE with k-NN classifier. Among the balancing methods, CUE seemed to be the best among balancing methods, because it was placed closest to the ideal position (topright), and harmonized TRR and TNR fairly well as in the CoIL Challenge 2000 data set. In contrast with the CoIL Challenge 2000 data set, SMOTE reported a favorable TRR–TNR combination except with k-NN. Among classifiers, k-NN showed a somewhat different TRR–TNR graph. Unlike the other classifiers in which NS and OSS
are placed far from ROS, RUS, SMOTE, and CUE, NS and OSS resulted in larger TRRs, thus their relative distance to other balancing methods decreased. Fig. 10 shows the lift charts of all balancing methods for each classifier. Note that the mailing depth is limited by the top 30% prospects. When logistic regression was used, the differences among the lift values were not so significant. For other three classifiers, on the other hand, the differences were noticeable. With MLP or k-NN as base classifiers, RUS was found to be inferior to other balancing methods for the top 10% prospects. Because a great number of non-respondents were removed from the training data, the classifiers tended to make aggressive predictions, which resulted in giving a high purchase likelihood to non-respondents. With SVM as the base classifier, the lift values of NS and OSS were remarkable for the top 10% prospects. For the top 20% prospects, however, they did not capture as many respondents as did other methods. Our proposed method, CUE, was found to be competitive for the top 10% prospects: it was the best for k-NN classifier, and second best for MLP. As the mailing depth went more deeply, the effect of CUE was remarkable. For all classifiers, it resulted in the highest lift values for the top 20% and 30% prospects. A profit analysis was also conducted for the DMEF4 data set which contains how much the customers spent, in addition to whether they responded or not. Table 6 and Fig. 11 show the total profit and total revenue/cost of each balancing method with different classifier-unit mailing costs (mailing cost per person) combinations, respectively. The total profit is calculated as follows,
6749
1
NS
True non−response rate (TNR)
True non−response rate (TNR)
P. Kang et al. / Expert Systems with Applications 39 (2012) 6738–6753
OSS
0.9
CUE
ROS
0.8
RUS SMOTE
0.7 0.6 0.5 0.4 0.3 0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NS
0.9
OSS
CUE
SMOTE ROS
0.8
RUS
0.7 0.6 0.5 0.4 0.3 0.3
0.4
0.5
0.6
0.7
0.8
SMOTE
0.7 0.6 0.5 0.4 0.4
0.5
0.6
0.7
0.8
0.9
1
0.9
1
True response rate (TRR)
True non−response rate (TNR)
True non−response rate (TNR)
NS
ROS
RUS
True response rate (TRR)
1
CUE
0.8
0.3 0.3
1
OSS
0.9
0.9
1
1
OSS
NS 0.9
CUE RUS
0.8
ROS SMOTE
0.7 0.6 0.5 0.4 0.3 0.3
0.4
True response rate (TRR)
0.5
0.6
0.7
0.8
True response rate (TRR)
Fig. 9. True response rate (TRR) versus true non-response rate (TNR) for each classifier with DMEF4 data set.
Total profit ¼ Total purchase amount ðrevenueÞ Total mailing cost X X TARGDOLðxi ÞIðxi Þ c Iðxi Þ ¼ ¼
i X ðTARGDOLðxi Þ cÞIðxi Þ;
i
ð23Þ
i
where TARGDOL (xi) is the actual purchase amount of customer xi when he/she receives the mail. I(xi) = 1 if customer xi is predicted as a respondent by the model, otherwise I(xi) = 0. c is the mailing cost per person: we used $1, $3, $5, $7, and $10 in the experiment. A customer who purchases $100 is more valuable to the company than 10 customers who contribute only $5 each, since the former increases revenue by $100 while the latter increases revenue only by $50. Measures based on accuracy give an equal weight to these customers and indeed they prefer identifying the latter 10 customers correctly without false positives to identifying the former customer with some false positives. Profit analysis, on the other hand, can assist the firms to make a decision based on economic benefit. Referring back to the Table 6 and Fig. 11, note that the total revenue does not change if the unit mailing cost changes. Once the prediction is made, the total revenue is fixed. The total profit becomes a function of total mailing cost as in Eq. (23). When unit mailing costs are low ($1 or $3), balancing methods significantly increased profit compared to NS. Among the balancing methods, CUE resulted in the highest profit with LR, k-NN, and SVM, while ROS resulted in the highest profit with MLP only. As the unit mailing cost increased, however, the difference became small. Balancing methods generally not only captured more respondents, but also predicted more non-respondents to respond. Thus, they increased both the total revenue and the total cost as illustrated in Fig. 11. When the unit cost is low, the increased revenues were enough to
compensate for the increased cost. Thus, the total profits of RUS, ROS, SMOTE, and CUE were much higher than those of NS or OSS. When the unit cost is high, on the other hand, a large portion of the increased revenue were canceled out by the increased cost for some balancing methods. The effects of balancing methods on the total profit with regard to classifiers are as follows. First, OSS increased total revenue without much increasing total cost compared to NS. However, its absolute increases in revenue were not as large as those of other balancing methods, so its contribution to profit increase was trifling for most cases. With logistic regression, SMOTE resulted in the highest both revenue and cost. Although CUE had the total revenue a bit less than SMOTE, it saved more cost than SMOTE. Thus, CUE reported the highest profit for most unit costs. With MLP, ROS resulted in the highest total profit regardless of the unit cost. CUE followed the ROS but their difference were not very significant. With k-NN, CUE resulted in the highest total profit regardless of the mailing cost. Even though RUS increased revenue more than CUE, it required too much cost. As shown in Fig. 11(f), it required the highest campaign expenditure by far over other methods, which reduced the profit significantly. With SVM, CUE reported the highest total profit except when the mailing cost is very high. Although its total cost is almost same as RUS, ROS, and SMOTE, its profit increases are manifestly larger that theirs. 4.3. Summary Based on the experiments with two actual customer data sets, the effect of CUE can be summarized as follows. First, CUE evidently improved BCR. It was achieved by increasing the true
6750
P. Kang et al. / Expert Systems with Applications 39 (2012) 6738–6753
6.5
6.5
Lift value
5
4
5
3
4
3
2 10%
20% Mailing depth
2 10%
30%
6.5
20% Mailing depth
30%
6.5
5
4
3
NS OSS RUS ROS SMOTE CUE
6
5 Lift value
NS OSS RUS ROS SMOTE CUE
6
Lift value
NS OSS RUS ROS SMOTE CUE
6
Lift value
NS OSS RUS ROS SMOTE CUE
6
4
3
2 10%
20%
30%
2 10%
Mailing depth
20%
30%
Mailing depth
Fig. 10. Lift charts of all balancing methods for each classifier with DMEF4 data set.
Table 6 Total profit of each balancing methods with different classifier-unit mailing cost combinations.
a
Classifier
Cost
NS
OSS
RUS
ROS
SMOTE
CUE
LR
1$ 3$ 5$ 7$ 10$
$143,685 $136,615 $129,545 $122,475 $111,870
$164,074 $154,597 $145,120 $135,643 $121,427a
$273,526 $231,706 $189,886 $148,067 $85,337
$271,222 $230,714 $190,206 $149,699 $88,937
$286,332a $240,419 $194,506 $148,593 $79,723
$282,271 $242,463a $202,655a $162,847a $103,135
MLP
1$ 3$ 5$ 7$ 10$
$176,206 $167,403 $158,600 $149,797 $136,592
$191,920 $181,323 $170,727 $160,131 $144,236
$285,009 $241,983 $198,956 $155,930 $91,391
$304,891a $270,791a $236,692a $202,592a $151,442a
$300,024 $264,806 $229,588 $194,370 $141,543
$301,512 $267,050 $232,588 $198,126 $146,433
k-NN
1$ 3$ 5$ 7$ $10
$237,191 $220,151 $203,111 $186,071 $160,511
$266,006 $242,720 $219,435 $196,150 $161,221
$315,869 $263,070 $210,271 $157,473 $78,275
$247,346 $228,964 $210,582 $192,200 $164,627
$259,453 $238,468 $217,483 $196,499 $165,021
$318,904a $280,406a $241,908a $203,410a $169,584a
SVM
1$ 3$ 5$ 7$ 10$
$147,926 $139,896 $131,866 $123,836 $111,791
$173,446 $162,596 $151,746 $140,896 $124,621a
$280,392 $238,479 $196,566 $154,652 $91,783
$279,325 $236,605 $193,886 $151,167 $87,088
$276,280 $234,308 $192,335 $150,362 $87,404
$289,445a $247,379a $205,313a $163,247a $100,148
The highest profit among the methods under the same classifier and the unit cost.
response rate while keeping true non-response rate as high as possible. Second, CUE made the response model more stable than other balancing methods. It decreased the standard deviation of the BCR at least 50% for most cases. Third, the impact of CUE was classifier independent. With four different classifiers, CUE
resulted in the outstanding performance in terms of BCR, its variation, and lift values. Fourth, CUE contributed to the profit increase more than any other benchmark methods. It achieved the highest profit by significantly increasing the revenue with limited cost increases.
6751
P. Kang et al. / Expert Systems with Applications 39 (2012) 6738–6753 5
3.5
5
x 10
3.5
x 10
Total cost ($)
Total revenue ($)
NS
3 NS OSS RUS ROS
2.5
SMOTE CUE
2
3
OSS
2.5
RUS ROS SMOTE
2
CUE
1.5 1
1.5
0.5
1 $1
$3
$5 $7 Unit mailing cost
0 $1
$10
5
$5 $7 Unit mailing cost
$10
5
x 10
3.5
$3
x 10
3.5
NS OSS RUS
3 NS OSS RUS
2.5
ROS SMOTE CUE
2
Total cost ($)
Total revenue ($)
3
ROS
2.5
SMOTE CUE
2 1.5 1
1.5
0.5
1 $1
$3
$5
$7
0 $1
$10
$3
Unit mailing cost
5
NS OSS RUS
3
Total cost ($)
Total revenue ($)
$10
x 10
3.5 3
2.5 2
NS OSS
ROS SMOTE
2.5
CUE
2 1.5 1
RUS
1.5
ROS
0.5
SMOTE CUE
1 $1
$3
$5
$7
0 $1
$10
$3
Unit mailing cost
3.5
$7
5
x 10
3.5
$5
Unit mailing cost
x 10
$5
$7
$10
Unit mailing cost
5
5
3.5
x 10
NS OSS RUS
3
ROS NS OSS
2.5
RUS ROS SMOTE CUE
2
Total cost ($)
Total revenue ($)
3 2.5
SMOTE CUE
2 1.5 1
1.5 1
0.5 $1
$3
$5
$7
$10
Unit mailing cost
0
$1
$3
$5
$7
$10
Unit mailing cost
Fig. 11. Total revenue and cost of each method with various unit mailing costs.
5. Conclusion Class imbalance is one of the significant issues in response modeling since it prevents response models from achieving high performance. In this study, we proposed a new data balancing method, named CUE, based on clustering, under-sampling, and
ensemble to deal with class imbalance, hence to improve response modeling. In order to investigate the effect of the proposed data balancing methods, we built 24 response models with a combination of six data balancing methods and four classification algorithms using two actual customer data sets. We analyzed the results in terms of response rate and profit. Based upon the results
6752
P. Kang et al. / Expert Systems with Applications 39 (2012) 6738–6753
of the analyses, we found our proposed method to be valuable in that (1) it increased the response rate predictive accuracy, (2) it stabilized the response model, and (3) it increased profit by significantly increasing revenue. These favorable experimental results lead us to new directions in the future. First, we only focused on data balancing methods themselves. As discussed in the lift analysis, a proper algorithm modification may boost the response modeling performance if it is adequately combined with data balancing methods. Thus, we should investigate ways to integrate our proposed data balancing method with algorithm modification techniques. Second, it would be worthwhile to extend our proposed method to up-lift modeling. While traditional response modeling focuses on identifying the buyers, up-lift modeling finds the customers who will buy the product only when they are targeted by the marketing campaign. In other words, up-lift modeling tries to increase the profit by reducing the cost through skipping the customers who will buy anyway whether they are targeted or not. Since class imbalance is still a critical obstacle of up-lift modeling, it is worth studying the effect of our data balancing method for up-lift modeling. Acknowledgment The first author was supported by the research program funded by Seoul National University of Science & Technology (Seoultech) and the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science, and Technology (MEST) (No. 2011-0021893). The second author was supported by second stage of the Brain Korea 21 Project in 2011, the National Research Foundation of Korea grants funded by the Korean government (MEST) (Nos. 20110030814 and 20110000164), and the Engineering Research Institute of Seoul National University. References Aaker, D. A., Kumar, V., & Day, G. S. (2001). Marketing research. New York, USA: Wiley and Sons. Ahn, H., Choi, E., & Han, I. (2007). Extracting underlying meaningful features and canceling noise using independent component analysis for direct marketing. Expert Systems with Applications, 33(1), 181–191. Baesens, B., Viaene, S., Van den Poel, D., Vanthienen, J., & Dedene, G. (2002). Bayesian neural network learning for repeat purchase modelling in direct marketing. European Journal of Operational Research, 138(1), 191–211. Berkin, P. (2006). A survey of clustering data mining techniques. In J. Kogan, C. Nicholas, & M. Teboulle (Eds.), Grouping multidimensional data (pp. 25–71). Berlin, Germany: Springer. Berry, M. J., & Linoff, G. (2004). Data mining techniques. New York, USA: Wiley and Sons. Bishop, C. M. (1995). Neural networks for pattern recognition. New York, USA: Oxford University Press. Blaszczyn´ski, J., Dembczyn´ski, K., Kotlowski, W., & Pawlowski, M. (2006). Mining direct marketing data by emsembles of weak learners and rough set methods. In Proceedings of international conference on data warehousing and knowledge discovery (DAWAK’06), Krakow, Poland. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. Bruzzone, L., & Serpico, S. B. (1997). Classification of imbalanced remote-sensing data by neural networks. Pattern Recognition Letters, 18(11–13), 1323–1328. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (1998). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (1994). (Smote) Boost: Improving prediction of the minority class in boosting. In Proceedings of 7th European conference on principles and practice of knowledge discovery in databases (PKDD’03), Cavtat-Dubrovnik, Croatia (pp. 107–119). Coenen, F., Swinnen, G., Vanhoof, K., & Wets, G. (2000). The improvement of response modeling: Combining rule-induction and case-based reasoning. Expert Systems with Applications, 18(4), 307–313. Colombo, R., & Jiang, W. (1999). A stochastic rfm model. Journal of Interactive Marketing, 13(3), 2–12. Cui, D., & Curry, D. (2005). Prediction in marketing using the support vector machine. Marketing Science, 24(4), 595–615. Dhillon, I., & Modha, D. (1999). A data clustering algorithm on distributed memory multiprocessors. In Proceedings of the fifth ACM SIGKDD, large-scale parallel KDD systems workshop, San Diego, CA, USA (pp. 245–260).
Duan, K.-B., Rajapakse, J. C., Wang, H., & Azuaje, F. (2005). Multiple svm-rfe for gene selection in cancer classification with expression data. IEEE Transactions on Nanobioscience, 4(3), 228–234. Elsner, R., Krafft, M., & Huchzermeier, A. (2004). Optimizing Rhenania’s direct marketing business through dynamic multilevel modeling (dmlm) in a multicatalog-brand environment. Marketing Science, 23(2), 192–206. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computers and Systems Sciences, 55(1), 119–139. Gehrt, K. C., & Shim, S. (1998). A shopping orientation segmentation of French consumers: Implications for catalog marketing. Journal of Interactive Marketing, 12(4), 34–46. Gönül, F. F., & Hofstede, F. T. (2006). How to compute optimal catalog mailing decisions. Marketing Science, 25(1), 65–74. Gönül, F. F., Kim, B.-D., & Shi, M. (2000). Mailing smarter to catalog customers. Journal of Interactive Marketing, 14(2), 2–26. Gruca, T. S., Klemz, B. R., & Petersen, E. A. F. (1999). Mining sales data using a neural network model of market response. ACM SIGKDD Explorations Newsletter, 1(1), 39–43. Ha, K., Cho, S., & MacLachlan, D. (2005). Response models based on bagging neural networks. Journal of Interactive Marketing, 19(1), 17–30. Halkidi, M., Vazirgiannis, M., & Batistakis, Y. (2000). Quality scheme assessment in the clustering process. In Proceedings of the 4th European conference on principles of data mining and knowledge discovery (PKDD’00). Lyon, France (pp. 265–276). Han, J., & Kamber, M. (2006). Data mining: Concepts and techniques. San Francisco, CA, USA: Morgan Kaufmann. Hill, S., Provost, F., & Volinsky, C. (2006). Network-based marketing: Identifying likely adopters via customer networks. Statistical Science, 21(2), 256–276. Hoi, C.-H., Chan, C.-H., Huang, K., Lyu, M. R., & King, I. (2004). Biased support vector machine for relevance feedback in image retrival. In Proceedings of international joint conference on neural networks (IJCNN’04), Budapest, Hungary. Hosmer, D. W., & Lemeshow, S. (1989). Applied logistic regression. New York, USA: Wiley and Sons. Huang, K., Yang, H., King, I., & Lyu, M. R. (2004). Learning classifiers from imbalanced data based on biased minimax probability machine. In Proceedings of IEEE computer society conference on computer vision and pattern recognition (CVPR’04), Washington, DC, USA (pp. 558–563). Hwang, H., Jung, T., & Suh, E. (2004). An LTV model and customer segmentation based on customer value: A case study on the wireless telecommunication industry. Expert Systems with Applications, 26(2), 181–188. Jain, A., Murty, M., & Flynn, P. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264–323. Jonker, J.-J., Piersma, N., & del Poel, D. V. (2004). Joint optimization of customer segmentation and marketing policy to maximize long-term profitability. Expert Systems with Applications, 27(2), 159–168. Kaefer, F., Heilman, C. M., & Ramenofsky, S. D. (2005). A neural network application to customer classification to improve the timing of direct marketing activities. Computers & Operations Research, 32(10), 2595–2615. Kang, P., & Cho, S. (2009). K-Means clustering seeds initialization based on centrality, sparsity, and isotrpy. In Proceedings of the 10th international conference on intelligent data engineering and automated learning (IDEAL’09), Burgos, Spain. Kang, P., & Cho, S. (2008). Locally linear reconstruction for instance-based learning. Pattern Recognition, 41(11), 3507–3518. Kedarisetti, K. D., Kurgan, L., & Dick, S. (2006). Classifier ensembles for protein structural class prediction with varying homology. Biochemical and Biophysical Research Communications, 348(3), 981–988. Kim, Y., Street, W. N., Russell, G. J., & Menczer, F. (2005). Customer targeting: A neural network approach guided by genetic algorithms. Management Science, 51(2), 264–276. Knott, A., Hayes, A., & Neslin, S. A. (2002). Next-product-to-buy models for crossselling applications. Journal of Interactive Marketing, 16(3), 59–75. Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced training sets: One-sided selection. In Proceedings of international conference on machine learning (ICML’97), Nashville, Tennessee, USA (pp. 179–186). Kubat, M., Holte, R. C., & Matwin, S. (1997). Learning when negative examples abound. In Proceedings of European conference on machine learning (EMCL’97), Prague, Czech Republic (pp. 146–153). Kubat, M., Holte, R., & Matwin, S. (1998). Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30(2), 195–215. Lai, Y.-T., Wang, K., Ling, D., Shi, H., & Zhang, J. (2006). Direct marketing when there are voluntary buyers. In Proceedings of international conference on data mining (ICDM’06), Hong Kong, China. Laurikkala, J. (2001). Improving identification of difficult small classes by balancing class distribution. In Proceedings of conference on AI in medicine in Europe (AIME’01). Cascais, Portugal (pp. 63–66). Lee, H., & Cho, S. (2007). Focusing on non-responders: Response modeling with novelty detection. Expert Systems with Applications, 33(2), 522–530. Lee, H.-J., Cho, S., & Shin, M.-S. (2008). Supporting diagnosis of attention-deficit hyperactive disorder with novelty detection. Artificial Intelligence in Medicine, 42(3), 199–212. Levin, N., & Zahavi, J. (2001). Predictive modeling using segmentation. Journal of Interactive Marketing, 15(2), 2–22. Ling, C. X., & Li, C. (1998). Data mining for direct marketing: Problems and solutions. In Proceedings of knowledge discovery and data mining (KDD’98), Hong Kong, China.
P. Kang et al. / Expert Systems with Applications 39 (2012) 6738–6753 MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (pp. 281–297). Malthouse, E. C. (2001). Assessing the performance of direct marketing scoring models. Journal of Interactive Marketing, 15(1), 49–62. Malthouse, E. C. (2002). Performance-based variable selection for scoring models. Journal of Interactive Marketing, 16(4), 10–23. Pizzi, N. J., Vivanco, R. A., & Somorjai, R. L. (2001). EvIdentTM: A functional magnetic resonance image analysis system. Artificial Intelligence in Medicine, 21(1–3), 263–269. Reutterer, T., Mild, A., Natter, M., & Taudes, A. (2006). A dynamic segmentation approach for targeting and customizing direct marketing campaigns. Journal of Interactive Marketing, 20(3–4), 43–57. Shin, H., & Cho, S. (2006). Response modeling with support vector machines. Expert Systems with Applications, 30(4), 746–760. Sun, B., Li, S., & Zhou, C. (2006). Adaptive learning and proactive customer relationship management. Journal of Interactive Marketing, 20(3–4), 82–96. Trujillo, M., & Izquierdo, E. (2005). Combining k-Means and semivariogram-based grid clustering. In Proceedings of the 47th international symposium ELMAR focused on multimedia systems and applications, Zadar, Croatia (pp. 9–12). Tsai, C.-Y., & Chiu, C.-C. (2004). A purchase-based market segmentation methodology. Expert Systems with Applications, 27(2), 265–276.
6753
Tsoumakas, G., Katakis, I., & Vlahavas, I. (2004). Effective voting of heterogeneous classifiers. In Proceedings of the 15th European conference on machine learning (ECML’04), Pisa, Italy (pp. 465–476). Valentini, G., & Dietterich, T. G. (2004). Bias-variance analysis of support vector machines for the development of SVM-based ensemble methods. Journal of Machine Learning Research, 5, 725–775. van der Putten, P., de Ruiter, M., & van Someren, M. (2000). Coil challenge 2000 tasks and results: Predicting and explaining caravan policy ownership. Tech. Rep. LIACS Technical Report 2000-09, Leiden Institute of Advanced Computer Science. Wu, G., & Chang, E. Y. (2003). Class-boundary alignment for imbalanced dataset learning. In Proceedings of ICML 2003 workshop on learning from imbalanced data sets (2), Washington, DC. Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3), 645–678. Yan, R., Liu, Y., Jin, R., & Hauptmann, A., 2003. On predicting rare classes with svm ensembles in scene classification. In Proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP’03), Hongkong, China. Zhang, J., & Krishnamurthi, L. (2004). Customizing promotions in online stores. Marketing Science, 23(4), 561–578. Zhuang, D., Zhang, B., Yang, Q., Yan, J., Chen, Z., & Chen, Y. (2005). Efficient text classification by weighted proximal svm. In Proceedings of IEEE international conference on data mining (ICDM’05), New Orleans, Louisiana, USA.