Available online at www.sciencedirect.com
Expert Systems with Applications Expert Systems with Applications 34 (2008) 1102–1108 www.elsevier.com/locate/eswa
Response modeling with support vector regression Dongil Kim, Hyoung-joo Lee, Sungzoon Cho
*
Seoul National University, San 56-1, Shillim-dong, Kwanak-gu, Seoul 151-744, Republic of Korea
Abstract Response modeling has become a key factor to direct marketing. In general, there are two stages in response modeling. The first stage is to identify respondents from a customer database while the second stage is to estimate purchase amounts of the respondents. This paper focuses on the second stage where a regression, not a classification, problem is solved. Recently, several non-linear models based on machine learning such as support vector machines (SVM) have been applied to response modeling. However, there is a major difficulty. A typical training dataset for response modeling is so large that modeling takes very long, or, even worse, modeling may be impossible. Therefore, sampling methods have been usually employed in practice. However a sampled dataset usually leads to lower accuracy. In this paper, we employed an e-tube based sampling for support vector regression (SVR) which leads to better accuracy than the random sampling method. 2006 Elsevier Ltd. All rights reserved. Keywords: Response modeling; Customer relationship management; Direct marketing; Support vector machines; Regression; Pattern selection
1. Introduction A response model, given a mailing campaign, predicts whether each customer will respond or how much each customer will spend based on the database of customers’ demographic information and/or purchase history. Marketers will send mails or catalogs to customers who are predicted to respond or to spend large amounts of money. A well-targeted mail increases profit while a mistargeted or unwanted mail not only increases marketing cost but also may worsen a firm’s relationship with its customers (Go¨nu¨l, Kim, & Shi, 2000; Potharst et al., 2000). Various methods have been used for response modeling such as statistical techniques (Bentz & Merunka, 2000; Haughton & Oulabi, 1997; Ling & Li, 1998; Suh, Noh, & Suh, 1999), machine learning techniques, (Cheung, Kwok, Law, & Tsui, 2003; Chiu, 2002; Shin & Cho, 2006; Viaene et al., 2001; Wang, Zhou, Yang, & Yeung, 2005; Yu & Cho,
*
Corresponding author. Tel.: +82 2 880 6275; fax: +82 2 889 8560. E-mail addresses:
[email protected] (D. Kim),
[email protected] (H.-j. Lee),
[email protected] (S. Cho). 0957-4174/$ - see front matter 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2006.12.019
2006) and neural networks (NN) (Bentz & Merunka, 2000; Potharst et al., 2000; Zahavi & Levin, 1997). In general, there are two stages in response modeling. The first stage is to identify respondents from a customer database while the second stage is to estimate purchase amounts of the respondents. The response modeling task in the first stage has been usually formulated as a binary classification problem. The customers are divided into two classes, respondents and non-respondents. A classifier is constructed to predict whether a given customer will respond or not. However, as pointed out in KDD98 Cup (1998) for the KDD-CUP-98 task, there is an inverse correlation between the likelihood to buy and the dollar amount to spend (Wang et al., 2005). This is because the more dollar amount is involved, the more cautious a customer becomes in making a purchase decision. Hence, one may need a regression model, in a second stage, to estimate the purchase amount of responding customers. Support vector machine (SVM) has been recently spotlighted with great generalization performances by employing the structural risk minimization (SRM) principle (Vapnik, 1995). Support vector regression (SVR), a regression version of SVM, was developed to estimate regression functions
D. Kim et al. / Expert Systems with Applications 34 (2008) 1102–1108
(Drucker, Burges, Kaufman, Smola, & Vapnik, 1997). Like SVM, SVR is capable of solving non-linear problems using kernel functions and has been successful in various domains (Drucker et al., 1997; Mu¨ller et al., 1997). However, there is a difficulty to train SVR on real-world dataset. As the number of training patterns increases, SVR training takes much longer with a time complexity of O(n3) where n denotes the number of training patterns. So far, many algorithms such as Chunking, SMO, SVMlight and SOR have been proposed to reduce the training time. However, their training time complexity is still strongly related to the number of training patterns (Platt, 1999). We take another direction called pattern selection which focuses on reducing the number of training patterns. Neighborhood property based pattern selection (NPPS) proposed by Shin and Cho (2003) is a powerful pattern selection method for SVM, but it is not for regression, but for classification. Recently, a pattern selection method based on the e-tube (SVR-e) was proposed which is specifically designed for SVR (Kim & Cho, 2006). Thus, we employ SVR as a response model to predict an amount of money spent by each respondent. One can improve the performance of a response model by identifying profitable respondents instead of just respondents among all customers. Hence, after applying a classification model that predicts likelihoods to buy, one needs a regression model that predicts amounts to buy. Since classification is not our main concern, we assume that a perfect classifier exists which can identify all respondents without false positive (FP) errors. So, an SVR model is constructed on a subset of customers which consists only of respondents. The procedure of constructing the proposed response model is depicted in Fig. 1. In order to reduce the training time of SVR, we employ the pattern selection method. The DMEF4 dataset from the Direct Marketing Educational Foundation (DMEF) is used in our experiments. A small dataset is used to measure improvement in efficiency. However, for a very large dataset, some kind of sampling is inevitable, anyway. The remaining of this paper is organized as follows. The concept of SVR is briefly introduced and the main idea of the pattern selection method is presented in Section 2. In Section 3, the DMEF dataset and our experimental settings are described in detail. Section 4 presents the experimental
1103
results. Section 5 concludes this paper with remarks on limitations and future research directions. 2. Pattern selection for support vector regression 2.1. Support vector regression For a brief review of SVR, consider a regression function f(x) which is estimated using a hyperplane w based on training patterns {(xi, yi)ji = 1, 2, , n} as follows: f ðxÞ ¼ w x þ b
with w; x 2 RN ; b 2 R;
where fðx1 ; y 1 Þ; ; ðxn ; y n Þg 2 RN R:
ð1Þ
By the SRM principle, the generalization accuracy is optimized over the empirical error and the flatness of the regression function which is guaranteed on a small w. Therefore, the objective of SVR is to include training patterns inside an e-insensitive tube (e-tube) while keeping the norm kwk2 as small as possible. An optimization problem can be formulated as the following soft margin problem: Minimize
n X 1 2 kwk þ C ðni þ ni Þ; 2 i¼1
Subject to y i w xi b 6 e þ ni ; w xi þ b y i 6 e þ ni ;
ð2Þ
ni ; ni P 0; i ¼ 1; ; n; where C, e, and n(n*) are a trade-off cost between the empirical error and the flatness, the size of the e-tube, and slack variables, respectively. By adding Lagrangian multipliers a and a*, the QP problem can be optimized as a dual problem. Also, SVR can estimate a non-linear function by employing a kernel function, k(xi, xj). The regression function estimated by SVR can be written as the following kernel expansion: f ðxÞ ¼
ns X
ða a Þkðxi ; xÞ þ b;
ð3Þ
i¼1
where ns is the number of support vectors. The RBF kernel is used in this paper. 2.2. Pattern selection for support vector regression
Fig. 1. The procedure of constructing the proposed response model. After identifying respondents using a classification model, dollar amounts to be spent by them are predicted using a regression model. This paper focuses on the two dark boxes.
The training time complexity of SVR is O(n3). If the number of training patterns increases, the training time increases more radically, i.e. in a cubic proportion. It takes too long to train SVR directly with a marketing dataset since marketing databases usually consist of over one millions of customers and hundreds of input variables. Thus, we apply a pattern selection method to reduce the training time of SVR as proposed in our previous research (Kim & Cho, 2006). SVR learns patterns based on the e-loss function foundation. SVR makes an e-tube around the training patterns. The patterns within the e-tube are not counted as errors, while patterns outside of the e-tube become support vectors
1104
D. Kim et al. / Expert Systems with Applications 34 (2008) 1102–1108
(SVs) that will be used for test. In addition, SVR estimates the regression function as the center-line of the e-tube. If the e-tube can be estimated before training, we can find the regression function with only those patterns inside the
e-tube (see Fig. 2). However, removing all patterns outside e-tube can lead to reduction of e-tube itself, thus it is desirable to keep some of ‘‘outside’’ patterns for training. A ‘‘fitness’’ probability is defined for each pattern based on its
Fig. 2. (a) The regression function after training the original dataset, and (b) the regression function after training only patterns inside the estimated e–tube.
Fig. 3. The example of the e-tube based pattern selection: (a) An SVR function trained with the original dataset, (b) an SVR trained with a bootstrap sample, (c) the original dataset and the e-tube of SVR in (b), and (d) An SVR trained with selected patterns.
D. Kim et al. / Expert Systems with Applications 34 (2008) 1102–1108
1105
Table 1 15 input variables used in the experiments Name
Formulation
Original variables Purseas Falord Ordtyr Puryear Sprord
Fig. 4. Algorithm of the e-tube based pattern selection.
location with respect to the e-tube and then patterns are selected stochastically based on their fitness probabilities. We generate k bootstrap samples of size l(l < n) from the original training dataset (D). An SVR function is trained with each bootstrap sample and k SVR functions are obtained. Each SVR function is used to evaluate if a training pattern is located inside its e-tube. Since k e-tubes are obtained for k SVR functions, each training pattern in D is located inside a minimum of zero e-tube to a maximum of k e-tubes. Let mj denote the number of times that pattern j is found inside the e-tubes. If we assume that the ‘‘true’’ etube is obtained when all training patterns are used, we can use mj as the likelihood that pattern j is actually located inside the true e-tube. Each mj is converted to a probability as follows, mj : ð4Þ p j ¼ Pn i¼1 mi Now, pattern j is selected with a probability of pj. The procedure of the pattern selection method is presented in Fig. 3 with a simple toy example. Note that the SVR functions in Fig. 3a and d are virtually identical. The algorithm is presented in Fig. 4. 3. Dataset and experimental settings 3.1. DMEF dataset We utilized the DMEF4 dataset which contains 101,532 customers and 91 input variables. The response rate is 9.4% with 9571 respondents and 91,961 non-respondents. A subset was selected based on the weighted dollars used in previous researches (Ha, Cho, & MacLachlan, 2005; Yu & Cho, 2006). We would like to use a regression model for scoring after identifying respondents with a classification model. Since we are not interested in classification, we assumed that there was a perfect classifier that could identify all respondent without false positive errors. Hence, we generated a new dataset consisting only of 4000 respondents. For performance evaluation, the dataset was partitioned into training and test sets. A half of customers were ran-
Derived variables Recency Tran53 I(180 6 recency 6 270) Tran54 I(270 6 recency 6 366) Tran55 I(366 6 recency 6 730) Tran38 1/recency P14 Comb2 m¼1 ProdGrpm Tran46 Tran42
pffiffiffiffiffiffiffiffiffiffiffiffiffiffi comb2 log(1 + ordtyr · falord)
Tran44
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ordhist sprord
Tran25
1/(1 + lorditm)
Description Number of seasons with a purchase LTD fall orders Number of orders this year Number of years with a purchase LTD spring orders derived variables Order days since 10/1992
Number of product groups purchased from this year Interaction between the number of orders Interaction between LTD orders and LTD spring orders Inverse of latest-season items
domly assigned to the training set while the other half to the test set. Ten different training/test splits were generated since performance of a model shows a large variation with regard to a specific data split (Malthouse, 2002). All results were averaged over the test splits. We are not interested in feature selection/extraction. Malthouse extracted 17 input variables for this dataset and Ha et al. (2005) used 15 out of them, removing two variables whose variations are negligible. In this paper, these 15 variables were used as input variables as listed in Table 1. We formulated this dataset to a regression problem by setting the total amount of dollars as the target variable. 3.2. Experimental settings Three different response models using SVR were compared: SVR with e-tube based pattern selection (SVR-e), SVR with all training patterns (SVR-All) and SVR with random sampling (SVR-Random). It was expected that SVR-e would be faster than SVR-All and more accurate than SVR-Random. The RBF kernel was used as a kernel function and the kernel parameter r was fixed to 1.0 for all datasets. The other hyper-parameters of SVR were determined by cross-validation with C · e = {0.1,1,5,10,50, 100} · {0.01,0.05,0.07,0.1,0.5,0.7,1}. The parameters of the pattern selection method were set as follows. The number of bootstrap samples k was set to 10. The number of patterns in a bootstrap sample l was set to 25% of the number of patterns in dataset, n. The number of patterns to be selected was varied from 10% to 90% of n. We evaluated performances both in terms of model fit and profit. The root mean squared error (RMSE) is used to estimate the model fit. The average profit per mail is
1106
D. Kim et al. / Expert Systems with Applications 34 (2008) 1102–1108
amounts. It was assumed that a marketer would send direct mails, given a certain mailing depth, to customers whose predicted dollar amounts were high. Also, the training times of the three models were compared. 4. Experimental results 4.1. Performance
Fig. 5. RMSEs of three SVR models with respect to the percentage of patterns selected: The solid, circled, and boxed lines represent SVR-All, SVR-e, and SVR-Random, respectively.
calculated to estimate the profitabilities of response models. Each model predicted the dollar amounts spent by the customers in the test set. Then, we sorted the customers by descending order in terms of the predicted dollar
a 150 130 120 110 100 90 80 70 60 50
SVR–All SVR–ε SVR–Random
140
Average profit per mail ($)
Average profit per mail ($)
b 150
SVR–All SVR–ε SVR–Random
140
Fig. 5 shows the RMSEs with respect to the number of selected patterns. As expected, SVR-All yielded the lowest RMSEs. As more patterns were selected, however, the training dataset became similar to the original dataset and the gaps between all SVRs became narrower. SVR-e is slightly less accurate than SVR-All, but is significantly more accurate than SVR-Random. With 10% of patterns selected, the difference of RMSEs between SVR-e and SVR-Random is larger than $15 while that between SVR-e and SVR-All was about $7. On the other hand, when 40% of patterns were selected, the difference between SVR-e and SVR-All decreased to less than $5, while SVR-Random resulted in the RMSE about $7 higher than
130 120 110 100 90 80 70 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
50
1
0.1
0.2
0.3
Mailing depth (%)
c 150
120 110 100 90 80 70 60 50
0.6
0.7
0.8
0.9
1
SVR–All SVR–ε SVR–Random
140
Average profit per mail ($)
Average profit per mail ($)
130
0.5
d 150
SVR–All SVR–ε SVR–Random
140
0.4
Mailing depth (%)
130 120 110 100 90 80 70 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Mailing depth (%)
0.8
0.9
1
50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Mailing depth (%)
Fig. 6. Average profit per mail based on three SVR models. A mailing depth indicates how many customers would receive direct mails. (a) 10% of patterns selected (b) 30% of patterns selected (c) 50% of patterns selected (d) 70% of patterns selected.
D. Kim et al. / Expert Systems with Applications 34 (2008) 1102–1108
SVR-All did. From another point of view, SVR-e with 10% of patterns was comparable to SVR-Random with 40% of patterns while SVR-e with 40% of patterns to SVR-Random with 60% of patterns. SVR-e showed little improvement when more than 70% of patterns were selected, but its RMSEs were only a hair above those of SVR-All. Fig. 6 compares three SVR models in terms of profit. Remember that all the customers were respondents who have been identified by a perfect classifier. Therefore, if we sent catalog mails to all respondents, the average profit per mail would be only about $60. On the other hand, if we selected respondents whose predicted dollar amounts were high, the average profit per mail would increase. For example, suppose SVR-All was used and the mailing depth was 5%. Then, the average profit would increase to about $120 though the average profit would decrease as mailing goes deeper. If the mailing cost is low, it is desirable to send mails to as many customers as possible. If the mailing cost is relatively high, on the other hand, one should send mails to only profitable few customers. Thus, a regression model is suitable for situations where the mailing cost is high. Comparing the three SVR models, SVR-All was again the best model. SVR-e was more profitable than SVR-Random, but less profitable than SVR-All. As the number of patterns selected increased, the differences between the three models got smaller. In particular, when 70% of patterns were selected, they yielded virtually identical profits, although they had differences in RMSEs.
4.2. Training time Fig. 7 shows the average training times of the three SVR models. SVR-Random took shortest because it was trained with the reduced numbers of patterns without any pattern selection procedure. SVR-All naturally took the longest training time with 1200 s. SVR-e took significantly shorter except when 90% of patterns were selected. In particular, with 10% of patterns selected, SVR-e needed only 16% of
1107
training time of SVR-All. Although the percentage of selected patterns was increased to 40% for which SVR-e resulted the RMSE only $5 higher than SVR-All, SVR-e took only one fourth of time of SVR-All. SVR-e was far more efficient than SVR-All. 5. Conclusions and discussion In response modeling, while identifying respondents is considered important, estimating purchase amount of the respondents is as important. We applied SVR for response modeling to find more profitable customers. In our experiments, the SVR models could select profitable customers among respondents. The regression model would be useful for situations where the mailing cost is relatively high. Also, SVR training takes very long when the size of training data becomes large. To reduce the training time complexity of SVR, the e-tube based pattern selection method was employed. SVR-e was more efficient than SVR-All and more accurate than SVR-Random. In practice, however, it is always the case that a large dataset is available. To run SVR, some kind of sampling is required. The proposed e-tube based sampling is then highly recommendable. There are some limitations in this research. First, it is unrealistic and/or impractical to assume a perfect classifier for respondent identification. Further research should be conducted that uses a ‘‘real’’ classification model. Second, SVR-e was efficient, but it was inferior to SVR-All. An improvement of its performance would be desirable for its practical utility. Finally, parameter selection for pattern selection was carried out only in an empirical manner. A systematic guideline would be helpful. Acknowledgement This work was partially supported by Grant No. R012005-000-103900-0 from Basic Research Program of the Korea Science and Engineering Foundation, Brain Korea 21, and Engineering Research Institute of SNU. References
Fig. 7. The training times.
Bentz, Y., & Merunka, D. (2000). Neural networks and the multinomial logit for brand choice modeling: a hybrid approach. Journal of Forecasting, 19, 177–200. Cheung, K.-W., Kwok, J. T., Law, M. H., & Tsui, K.-C. (2003). Mining customer product ratings for personalized marketing. Decision Support Systems, 35, 231–243. Chiu, C. (2002). A case-based customer classification approach for direct marketing. Expert Systems with Applications, 22(2), 163–168. Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A., & Vapnik, V. (1997). Support vector regression machines. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.). Advances in neural in-formation processing system (9, pp. 155–161). Cambridge, MA: MIT Press. Go¨nu¨l, F. F., Kim, B. D., & Shi, M. (2000). Mailing smarter to catalog customer. Journal of Interactive Marketing, 14(2), 2–16. Ha, K., Cho, S., & MacLachlan, D. (2005). Response models based on bagging neural networks. Journal of Interactive Marketing, 19(1), 17–30.
1108
D. Kim et al. / Expert Systems with Applications 34 (2008) 1102–1108
Haughton, D., & Oulabi, S. (1997). Direct marketing modeling with CART and CHAID. Journal of Direct Marketing, 11(4), 42–52. KDD98, The KDD-CUP-98 result. (1998). http://www.kdnuggets.com/meetings/kdd98/kdd-cup-98.html. Kim, D., & Cho, S. (2006). e-tube based pattern selection for support vector machines. Lecture Notes in Artificial Intelligence, 3918, 215–224. Ling, C. X., & Li, C. (1998). Data mining for direct marketing: problems and solutions. In Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD-98) (pp. 73–79). New York. Malthouse, E. C. (2002). Performance-based variable selection for scoring models. Journal of Interactive Marketing, 16(4), 37–50. Mu¨ller, K.-R., Smola, A., Ra¨tsch, G., Scho¨lkopf, B., Kohlmorgen, J., & Vapnik, V. (1997). Predicting time series with support vector machines. Lecture Notes in Computer Science, 1327, 999–1004. Platt, J. C. (1999). Fast training of support vector machines using sequential minimal optimization. In Advanced in kernel methods; support vector machines (pp. 185–208). Cambridge, MA: MIT Press. Potharst, R., Kaymak, U., & Pijls W. (2000). Neural networks for target selection in direct marketing. Erasmus Research Institute of Management (ERIM), RSM Erasmus University, Research Paper ERS-200114-LIS, http://ideas.repec.org/s/dgr/eureri.html.
Shin, H., & Cho, S. (2003). Fast pattern selection algorithm for support vector classifiers: time complexity analysis. Lecture Notes in Computer Science, 2690, 1008–1015. Shin, H., & Cho, S. (2006). Response modeling with support vector machines. Expert Systems with Applications, 30(4), 746–760. Suh, E. H., Noh, K. C., & Suh, C. K. (1999). Customer listsegmentation using the combined response model. Expert Systems with Applications, 17(2), 89–97. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer. Viaene, S., Baesens, B., Van Gestel, T., Suykens, J. A. K., Van den Poel, D., Vanthienen, J., et al. (2001). Knowledge discovery in a direct marketing case using least squares support vector machines. International Journal of Intelligent Systems, 16, 1023–1036. Wang, K., Zhou, S., Yang, Q., & Yeung, J. M. S. (2005). Mining customer value: from association rules to direct marketing. Data Mining and Knowledge Discovery, 11, 57–79. Yu, E., & Cho, S. (2006). Constructing response model using ensemble based on feature subset selection. Expert Systems with Applications, 30(2), 352–360. Zahavi, J., & Levin, N. (1997). Applying neural computing to target marketing. Journal of Direct Marketing, 11(4), 76–93.