An encoder–decoder switch network for purchase prediction

An encoder–decoder switch network for purchase prediction

Knowledge-Based Systems xxx (xxxx) xxx Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/k...

1011KB Sizes 0 Downloads 40 Views

Knowledge-Based Systems xxx (xxxx) xxx

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

An encoder–decoder switch network for purchase prediction✩ Chanyoung Park a , Donghyun Kim b , Hwanjo Yu c ,



a

Department of Computer Science, University of Illinois at Urbana-Champaign, IL, USA Yahoo Research, CA, USA c Department of Computer Science and Engineering, POSTECH, Pohang, South Korea b

article

info

Article history: Received 18 April 2019 Received in revised form 7 August 2019 Accepted 9 August 2019 Available online xxxx Keywords: Recommender system Purchase prediction Sequential prediction

a b s t r a c t Users in e-commerce tend to click on items of their interest. Eventually, the more frequently an item is clicked by a user, the more likely the item will be purchased by the user after all. However, what if a user clicked on every item only once before purchases? This is a frequently observed user behavior in reality, but predicting which of the clicked items will be purchased is a challenging task. This paper addresses a practical yet widely overlooked task of predicting purchase items within a non-duplicate click session, i.e., a session in which every item is clicked only once. We propose an encoder–decoder neural architecture to simultaneously model users’ click and purchase behaviors. The encoder captures a user’s intent contained in the user’s click session, and the decoder, which is equipped with pointer network via a switch gate, extracts relevant clicked items for future purchase candidates. To the best of our knowledge, our work is the first to address the task of purchase prediction given non-duplicate click sessions. Experiments demonstrate that our proposed method outperforms the state-of-the-art purchase prediction methods by up to 18% in terms of recall. © 2019 Elsevier B.V. All rights reserved.

1. Introduction When users shop online, they usually click through multiple items before making the final decision to purchase. Even if they merely clicked on items today without making any purchase, they usually come back later to reconsider and purchase the items that they had clicked in the past. As a concrete example, we discovered from our data analyses that about 72% of Taobao1 and 42% of Tmall2 users come back later to purchase the items that they had clicked in the past. In this regard, showing a user in advance the items that the user is likely to purchase among his/her previously clicked items will assist the user’s decision to purchase. However, although extracting relevant items from a user’s click session is a noteworthy task, it has garnered less attention from both the academia and the industry. We conjecture that the reason is because purchased items tend to receive more clicks than non-purchased items, and thus we can readily infer ✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys. 2019.104932. ∗ Corresponding author. E-mail addresses: [email protected] (C. Park), [email protected] (D. Kim), [email protected] (H. Yu). 1 https://goo.gl/2yULdD. 2 https://ijcai-15.org/index.php/repeat-buyers-prediction-competition.

purchased items among clicked items. Our data analysis shows that we can already obtain 68% of recall3 in predicting purchase items merely by regarding the most frequently clicked item as the purchased items without training any model, which shows how trivial this task is. On the other hand, predicting purchase items within a nonduplicate click session, i.e., a session in which every item is clicked only once, is a practical yet widely overlooked task. For example, a user might examine multiple items by simultaneously opening multiple tabs in a browser rather than revisiting them back and forth, in which case each item receives only one click. In other words, to compare multiple items before making the final purchase, it is common to open multiple tabs in a browser to look at multiple items at a glance, instead of going back and forth in a single browser. In this case, the system thinks that each item is clicked only once, although these items were glanced multiple times. Fig. 1 shows a toy example of ‘‘duplicate’’ and ‘‘non-duplicate’’ click sessions. While it is a trivial task to predict that the most frequently clicked brown sandal is likely to be purchased after all (Fig. 1a), it is challenging to infer which of the clicked items will be purchased if every item is clicked

3 We filtered out items with less than five interactions and sessions with a single item, and obtained 68.36% of recall.

https://doi.org/10.1016/j.knosys.2019.104932 0950-7051/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: C. Park, D. Kim and H. Yu, An encoder–decoder switch network for purchase prediction, Knowledge-Based Systems (2019) 104932, https://doi.org/10.1016/j.knosys.2019.104932.

2

C. Park, D. Kim and H. Yu / Knowledge-Based Systems xxx (xxxx) xxx Table 1 Notations. Notation

Description

S

Set of sessions (|S |= R) kth session. A sequence of clicks in the kth session. A sequence of purchases in the kth session. Set of items. The number of items. The size of an item embedding vector. The number of hidden units. Item embedding matrix. Encoder hidden state at step i. Decoder hidden state at step t. A fully-connected layer with a bias unit. A switch gate to select btw. Psoft or Pptr .

sk Ck Pk I

Fig. 1. ‘‘Duplicate’’ and ‘‘non-duplicate’’ click sessions.

only once as in Fig. 1b. Our data analysis even shows that nonduplicate click sessions account for about 40%4 of e-commerce platforms such as Yoochoose, the amount of which should not be overlooked. Hence, in this paper, we address for the first time a novel and practical task of predicting purchase items within a nonduplicate click session. The core idea of our method is based on the assumption that purchases never happen without clicks in the past, which in turn restricts the items in the purchase sequence to be always a subset of the items in the click sequence. This implies that for each user the candidate items for a purchase do not need to be the entire set of items, but only the items in the click sequence. For example, if a user clicked on items v1 , v2 , and v3 in a session, we only need to consider these three items as purchase candidates rather than the entire set of items. Existing approaches [1–6] can be applicable to our task, however, they consider the set of all possible items as candidates for purchases even though the purchased items are a subset of the clicked items. To be more specific, they consider all items across sessions as candidates for purchases (global relationship), while missing the relationship modeling among clicked items within a session (local relationship). However, we argue that the clicked items within the current session of interest should be emphasized when our goal is to predict purchase items among a click session. In the light of this issue, we propose PurchaseNet that consists of (1) a click sequence encoder that captures a user’s intent by encoding a sequence of user clicks, and (2) a purchase sequence extractor (decoder) that extracts one or more relevant clicked items for the future purchases by jointly modeling the global and local relationships among items. The core idea of PurchaseNet is to directly account for the local relationship among items within a session by endowing our decoder with pointer network [7], which is an attention-based neural network that selects a member of the input sequence as the output. Moreover, we introduce a switch gate into our decoder to facilitate our method not only (1) to focus on the local relationship among clicked items within a session, but also (2) to consider the global relationship among all items across sessions. The decoder is also able to generate a purchase sequence in case a user sequentially purchases multiple items within a session. We expect the encoder to capture the user intent contained in a click session so as to help the decoder in extracting a sequence of relevant items based on the encoded intent. Experimental results show that PurchaseNet outperforms the state-of-the-art purchase prediction methods by up to 18% in terms of recall.

• We newly introduce a worth addressing yet overlooked task of predicting purchased items among non-duplicate click history. 4 After filtering out the items with less than five interactions and sessions with a single item, 40.09% of the sessions are non-duplicate sessions in Yoochoose dataset5.

m d K M ∈ Rd×|I | hi ∈ R K st ∈ R K FC (·) p(soft)

• We propose an encoder–decoder neural architecture that is endowed with pointer network, which is intuitive and appropriate for our task. We introduce a switch gate to regulate the importance of the pointer network and the conventional neural decoder with softmax. • Experimental results on multiple real-world e-commerce datasets demonstrate that our proposed method, called PurchaseNet, outperforms state-of-the-art purchase prediction methods. 2. Our approach In this section, we describe several components of our proposed method called PurchaseNet. We first explain the encoder that captures a user intent of the current session by reading a sequence of clicks, and then explain the decoder that extracts the most likely to be purchased items among the clicks read by the encoder. 2.1. Problem formulation Let S = {s1 , s2 , . . . , sR } be a set of R sessions, where the kth session sk = ([C k ], [P k ]) consists of a sequence of clicks C k and purchases P k on items vi ∈ I , where I is the set of m items. C k is a non-duplicate click session if the number of unique items in C k equals to the length of C k . For example, C 1 = [v1 , v2 , v3 ] is a non-duplicate click session as the number of unique items in C 1 , i.e., 3, is equal to the length of C 1 , i.e., 3. On the other hand, C 2 = [v1 , v2 , v1 ] is a duplicate click session as the number of unique items in C 2 is 2, whereas the length of C 2 is 3. We assume that a purchase of a certain item never occurs without prior clicks on the item. For example, s = ([v2 , v4 , v5 ], [v4 ]) is a valid session where v4 is purchased after sequentially clicking v2 , v4 and v5 , whereas s = ([v2 , v4 , v5 ], [v1 ]) is not valid because v1 is not clicked before. Note that we consider the entire click sequence within a session. For example, given a session s = ([v2 , v4 , v5 ], [v4 ]), the following can also be considered as valid: ([v4 ], [v4 ]), ([v2 , v4 ], [v4 ]), and ([v4 , v5 ], [v4 ]). However, we only consider ([v2 , v4 , v5 ], [v4 ]) as a valid session for training because it consists of the entire click sequence. Table 1 summarizes the notations used throughout this paper. In short, we aim to predict which of the clicked items within a non-duplicate click session will be purchased in the future. 2.2. RNN-based encoder The encoder is a bidirectional recurrent neural network (RNN), i.e., RNNenc (·), that reads the input click sequence [c1 , c2 , . . . , cn ], where ci ∈ R (1 ≤ i ≤ n) is the index of the ith clicked item.

Please cite this article as: C. Park, D. Kim and H. Yu, An encoder–decoder switch network for purchase prediction, Knowledge-Based Systems (2019) 104932, https://doi.org/10.1016/j.knosys.2019.104932.

C. Park, D. Kim and H. Yu / Knowledge-Based Systems xxx (xxxx) xxx

3

Fig. 2. The overall architecture of PurchaseNet.

Before feeding the input sequence into the encoder, we convert it into the input embedding matrix E ∈ Rd×n by using the item embedding matrix M ∈ Rd×|I | , where d is the embedding size, and E(:, i) = M(:, ci ). Next, E is fed into the RNN forward pass, and − → − → − → creates a sequence of hidden states, [ h 1 , h 2 , . . . , h n ], where − → − → − → −−→ h i = RNNenc (ci , h i−1 ); h i ∈ RK . In addition, the reversed E is ← − ← − ← − fed into the RNN backward, and creates [ h n , h n−1 , . . . , h 1 ], ← − ← − ←−− where h i = RNNenc (ci , h i+1 ). Note that we used GRU for RNN(·) in this work [8]. Finally, the forward and backward hidden states are concatenated to create the encoder hidden states − → ← − [h1 , h2 , . . . , hn ], where hi = RNNenc (ci , hi−1 , hi+1 ) = [ h i ; h i ] ∈ 2K R ; [·; ·] denotes the concatenation operation. We expect hi to capture the intent of the user until the ith clicked item of a session. 2.3. Decoder with attention The decoder is a unidirectional RNN, whose goal is to predict the purchase probability distribution over m items at each timestep t. To begin with, the decoder initializes its hidden state q0 ∈ RK by using the last hidden state of the encoder, i.e., hn ; q0 = WTinit hn , where Winit ∈ R2K ×K . The decoder hidden state at timestep t is computed based on the previous decoder hidden state qt −1 , and the previously predicted purchase item pt −1 , where pt −1 is the index of the previously purchased item: qt = RNNdec (qt −1 , M(:, pt −1 )). Note that we share the item embedding matrix M between the encoder and the decoder, because the entire network is dealing with the same set of m items. Next, to predict a purchase item at each decoder timestep t based on the clicked items, we adopt the attention mechanism [9] that tells the decoder which of the clicked item within the input click sequence to look at. The attention distribution at = [at ,1 , at ,2 , . . . , at ,n ] is calculated as: n ∑

at ,i = exp(et ,i )/

exp(et ,l )

(1)

l

where et ,i = v T tanh(WT [hi ; qt ]), W ∈ R3K ×K and v ∈ RK . at ,i can be considered as the importance of the ith clicked item in predicting the purchase item at decoder time step t. Then, the attention distribution is used to produce a context vector

ctxt , which ∑n is a weighted sum of the encoder hidden states: ctxt = i=1 at ,i hi . Finally, the decoder hidden state qt is concatenated with the context vector ctxt , and passed through two fully-connected layers followed by a softmax layer to produce the purchase probability distribution over m items: Psoft = softmax(F C2 (F C1 ([ctxt ; qt ]))) ∈ Rm

(2)

where FC1 (·) and FC2 (·) are fully-connected layers with bias units. The probability of item vj being purchased is: Ppurchase (vj ) = Psoft (vj ) ∈ R

(3)

where P∗ (vj ) is the jth element of the vector P∗ . Psoft is a vector of probability distribution in which the kth dimension denotes the probability of item k being purchased (1 ≤ k ≤ m). As we only need to calculate the purchase probability of the items that had been previously clicked in the current session, we extract the probabilities for n items that had been previously clicked among m items. Although the above method is very straightforward, we argue that it is inefficient to calculate the entire m-dimensional vector Psoft in the first place, when we only require the probabilities for n clicked items in the current session (n ≪ m). Hence, we propose to adopt the pointer network, which is more suitable for our task. 2.4. Pointer network Recall that pointer network [7] is an attention-based neural network that learns the conditional probability of a output whose values correspond to the positions in an input sequence. We argue that pointer network is a suitable architecture for our task, because in our task the number of candidate items for purchase, i.e., n, is very few in number. Therefore, calculating the purchase probabilities of the entire m items as in Eq. (2) is both inefficient in terms of computational cost, and ineffective in terms of training as the focus of the model can be distracted by the non-clicked candidate items included in the denominator of the softmax term. By employing pointer network, we only need to compute the attention distribution over the input click sequence as in Eq. (1). Therefore, the probability of item vj being purchased (i.e., Pptr (vj )) at decoder timestep t is computed as follows: Ppurchase (vj ) = Pptr (vj ) = I[ci = vj ] · at ,i ∈ R

(4)

Please cite this article as: C. Park, D. Kim and H. Yu, An encoder–decoder switch network for purchase prediction, Knowledge-Based Systems (2019) 104932, https://doi.org/10.1016/j.knosys.2019.104932.

4

C. Park, D. Kim and H. Yu / Knowledge-Based Systems xxx (xxxx) xxx

where ci is the index of the ith clicked item in the input click sequence, and I[·] is an indicator function that outputs 1 if the condition is true, otherwise 0. In other words, the probability of purchasing item vj is equivalent to the attention weight given to the position of vj in the input click sequence. As a concrete example, if item vj was clicked in the third position, then Pptr (vj ) = at ,3 . By computing the purchase probability as in Eq. (4) instead of Eq. (3), we not only reduce the computational cost, but also gain greater focus on the items that appear in the input click sequence rather than the entire set of m items.

Algorithm 2: Encoder Function RNNenc (E): Input : Embedded click sequence E, Output: Encoder hidden states H, encoder last hidden state H(:, −1)

− → −−→ H ← RNNenc (E) // Forward RNN ← − ←−− H ← RNNenc (E) // Backward RNN − → ← − H = H + H // Bidirectional RNN return H, H(:,-1); // H(:,-1) is the last index of H

2.5. Proposed method: PurchaseNet Although pointer network improves the model efficiency, and also focuses solely on the input click sequence, we argue that it fails to model the global relationship among all items across the sessions, because it uses the attention weights on the input sequence as the final predictions, which overlooks the items outside the session. Recall that, unlike pointer network, the conventional neural network with softmax considers the interplay among the entire set of m items at each decoder timestep at the expense of the computational burden (Eq. (2)). More precisely, the softmax part of Eq. (2) contains the entire m items as the denominator, and thus Psoft considers the global relationship among the entire set of m items for predictions, in contrast to Pptr , which only considers the local relationship among the items within a input click sequence. To achieve the best of both worlds, we propose to combine Psoft in Eq. (3) and Pptr in Eq. (4) by means of a switch gate that decides whether to use Psoft or Pptr for the final prediction. The switch gate S(soft) is defined as follows: S(soft) = σ (FCswitch ([ctxt ; qt ; M(:, pt −1 )]))

Function RNNdec (H, hinit ): Input : Encoder hidden states H, initial decoder input h0 , initial decoder hidden state q0 Output: Decoder Output Ppurchase ∈ R|I | p0 ←< START > // pt : Item purchased on timestep t for t ← 1 to n do qt ← RNN(qt −1 , M(:,pt −1 )) ctxt , at ← Attn(qt , H) Psoft ← Softmax(FC2 (FC1 [ctxt ; qt ])) foreach item vj ∈ I do Pptr (vj ) ← I[ci = vj ] · at ,i end S(soft) = σ (FCswitch ([ctxt ; qt ; M(:, pt −1 )])) foreach item vj ∈ I do Ppurchase (vj ) = S(soft)Psoft (vj ) + (1 − S(soft))Pptr (vj ). end end return Ppurchase

(5)

where FCswitch (·) is a fully-connected layer, and σ (·) is a sigmoid function. The rationale behind the design of S(soft) is to increase the probability of using softmax if the context vector ctxt aligns well with the current decoder hidden state qt and the previously purchased item pt −1 , which indicates that ctxt plays an important role in predicting the next purchased item. Hence, the final prediction probability of item vj being purchased is defined as: Ppurchase (vj ) = S(soft)Psoft (vj ) + (1 − S(soft))Pptr (vj )

Algorithm 3: Decoder

(6)

We note that the switch gate can be either a binary switch that makes a binary decision between Psoft and Pptr , or a probabilistic switch that blends them probabilistically. Algorithm 1: PurchaseNet algorithm Input : A set of sessions S , batch size b, embedding dimension d, Output: Learned model parameters for −−→ ←−− RNNenc , RNNenc , RNNdec , FC1 , FC2 and learned weight parameters M, Winit , W Initialize trainable parameters while Convergence do foreach session s = ([C ], [P ]) ∈ S do Embed C = [c1 , c2 , ..., cn ] using M to obtain E ∈ Rd×n H, h0 = RNNenc (E) // Encode Ppurchase ← RNNdec (H, h0 ) // Decode Update model parameters and weights using L in Eq. (7). end end

Given the final predicted purchase probability for session sk ∈ sk S as Ppurchase (v ), the final objective function of PurchaseNet is given as follows: L=

R ∑ ∑ ∑

k

k

k

k

s s ysj Ppurchase (v ) + (1 − ysj )(1 − Ppurchase (v ))

(7)

k=1 P k ∈sk v∈P k

We provide pseudocode of PurchaseNet in Algorithms 1, 2, 3. 2.6. Discussion In case a user sequentially purchases multiple items within a session, one could argue that instead of decoding a sequence of purchased items as shown in Fig. 2, we could predict the output at once similar to multi-class classification, i.e., one-shot decoding. In other words, we could perform the decoding only one step, and use the output of the first timestep of the decoder as the final prediction, which is an m-dimensional vector. Then, we could backpropagate the loss calculated between this output vector and a label vector whose elements are 1 if the corresponding items are purchased, or otherwise 0. However, we argue that as purchasing an item may affect the next purchasing item, decoding a sequence of purchases is a more intuitive approach. We show in our experiments that our sequence decoding scheme outperforms the one-shot decoding approach. 2.7. Complexity analysis The overall time complexity of PurchaseNet is mainly composed of the computations of (1) encoder RNNenc (·), (2) decoder RNNdec (·), and (3) the attention mechanism while decoding

Please cite this article as: C. Park, D. Kim and H. Yu, An encoder–decoder switch network for purchase prediction, Knowledge-Based Systems (2019) 104932, https://doi.org/10.1016/j.knosys.2019.104932.

C. Park, D. Kim and H. Yu / Knowledge-Based Systems xxx (xxxx) xxx

shown in Eq. (1). To be specific, let |C | and |P | denote the lengths of the click and purchase sequences, respectively. Then, the time complexity of PurchaseNet is O(|C | + |P | + |C ||P |) = O(|C ||P |), where O(|C ||P |) is the complexity of the attention mechanism Eq. (1). On the other hand, the complexity of the most relevant baseline, i.e., GRU4REC, depends only on the length of the click sequences, i.e., O(|C |), because GRU4REC only performs encoding step. In summary, PurchaseNet is efficient, because the length of purchase sequences is far shorter than the lengths of click sequences, i.e., |P | ≪ |C |, as shown in Table 2. Moreover, it is worth mentioning that the complexity of the attention mechanism adopted in this paper can be improved by recently proposed techniques [10,11].

5

• CDAE [2]: The state-of-the-art autoencoder-based method whose objective is to reconstruct the input, where the input is a multi-hot vector of clicked items. To compare CDAE with our method, we predict item purchases from the intermediate hidden layer of the autoencoder. • GRU4REC [1]: A state-of-the-art session-based next click prediction method based on GRU. As its goal (click prediction) differs from ours (purchase prediction), we added a fully connected layer at the end of last hidden state of GRU4REC for predicting the purchased items. We also tried a popularity-based method that outputs the most popular items throughout the entire dataset. However, as the performance was very poor, we excluded in the paper.

3. Experiments The experiments are designed to answer the following research questions (RQs): RQ1. How does PurchaseNet perform compared with the stateof-the-art purchase prediction methods? RQ2. What type of switch performs better? (Binary vs. probabilistic) RQ3. Does decoding a sequence of purchase outperform the one-shot decoding in purchase prediction? RQ4. How do the hyperparameters of PurchaseNet affect the prediction accuracy? 3.1. Dataset We evaluate our proposed method on three real-world datasets: Yoochoose,5 Xing6 and Taobao.7 Yoochoose and Taobao datasets are public e-commerce datasets each of which contains both click and purchase records of user sessions. To demonstrate that our proposed method is a general framework that can be applied to similar tasks in which the output depends on the input, we additionally used a dataset published for a job recommendation challenge, where clicks and bookmarks of users are provided. As bookmarks must happen after clicks, we assumed that a bookmark is analogous to a purchase in our e-commerce setting. For datasets with user identification, i.e., Taobao and Xing, we regarded a user’s actions taken place on a day as an independent session. For all datasets, to filter out noisy samples, we removed users and items with fewer than 5 interactions, removed sessions with no purchase, and retained only non-duplicate sessions of length greater than equal to 10. The statistics of the preprocessed datasets used in our experiments are summarized in Table 2.

3.3. Evaluation protocol and metrics To evaluate our newly defined task of purchase prediction among non-duplicate click session, we divide the datasets into training, validation, test sets in a 80%/10%/10% split in terms of the sessions. That is, each session can only belong to one of training, validation or test dataset. After session-wisely training our model, for each session we extract top-N items for predictions. Recall that, following our initial assumption, the candidate items for purchases are the items that were clicked in each session; not the total set of items. For example, if there are 10 items in the click sequence of a session, then the candidates for purchases are those 10 items in the click sequence. For this reason, we cannot fix N as conventional ranking methods, because the number of clicked items in a session can be smaller than a fixed N. Instead, we extract top-N items for predictions, where N is the number of the ground truth purchased items in each session; thus, N may vary among sessions. For example, given a session with 10 clicked items and 2 purchased items, we examine the scores only for the 10 clicked items, and extract top-2 amongst them. Since our task is a ranking problem, we compute recall and mean reciprocal rank (MRR) between N predicted items and N ground truth purchases; recall is equivalent to precision because the number of predictions is equal to the number of purchased items in each session. Recall computes how many of the predicted items are actually in the ground truth, and MRR is a position-aware metric that adds all the reciprocal ranks of the predicted items that appear in the ground truth. Recall and MRR are defined as follows: Recall =

3.2. Methods compared

• ItemKNN: An item-based CF method that predicts purchases based on item co-occurrence. We compute the item cooccurrence matrix within each session, and combine them globally across the entire sessions. We extract the clicked items with the highest co-occurrence values in each session for prediction. • MLP: A multi-layer perceptron with three layers that given a multi-hot click vector of a session as input, computes the probability of the purchased items in the current session. 5 http://2015.recsyschallenge.com/challenge.html. 6 http://2016.recsyschallenge.com/. 7 https://goo.gl/8FwWUu.

MRR =

R 1∑∑

Z

R 1∑∑

Z

I[v ∈ pred(i)]

(8)

i=1 v∈P i

i=1 v∈P i

1 ranki (v )

(9)

where ranki (v ) is the rank of item v among predictions for session i, pred(i) is the predicted items for session i, and∑Z the total R i number of purchased items in the dataset. i.e., Z = i=1 |P |. Note that we do not consider metrics, such as root mean squared error (RMSE) and mean absolute error (MAE) because we are interested in the ranking of each predicted item, rather than the absolute likelihood probability for each predicted item. In other words, we are interested in the relative ranking among the candidate items, which cannot be expressed by the error measures like RMSE and MAE.

Please cite this article as: C. Park, D. Kim and H. Yu, An encoder–decoder switch network for purchase prediction, Knowledge-Based Systems (2019) 104932, https://doi.org/10.1016/j.knosys.2019.104932.

6

C. Park, D. Kim and H. Yu / Knowledge-Based Systems xxx (xxxx) xxx Table 2 Dataset statistics. Dataset

Type

#sessions

#items

Yoochoose

#actions

#actions/session(sequence length)

Click Purchase

25,400

17,071 7,881

386,771 95,650

15.22 3.76

Taobao

Click Purchase

110,521

80,382 44,488

3,314,298 143,123

21.41 2.31

Xing

Click Bookmark

9,595

29,034 12,786

205,513 22,192

29.98 1.29

Table 3 Best performing learning rate (lRate) and optimizer of each compared method on the three datasets. Yoochoose

MLP CDAE GRU4REC Psoft Pptr PurchaseNet

Taobao

Xing

lRate

Optimizer

lRate

Optimizer

lRate

Optimizer

0.001 0.001 0.1 0.001 0.01 0.0001

Adam Adam Adam Adam Adagrad Adam

0.01 0.0001 0.0001 0.0001 0.001 0.001

Adagrad Adam Adam Adagrad Adam Adam

0.01 0.1 0.0001 0.001 0.001 0.0001

Adam Adagrad Adam Adagrad Adagrad Adam

Table 4 Test performance of different methods. (Imp. denotes the improvement of PurchaseNet over the best competitor.) Dataset

Method

Random

Recent

ItemKNN

MLP

CDAE

GRU4REC

Psoft

Pptr

PurchaseNet

Imp.

Yoo-choose

Recall MRR

0.2310 0.1240

0.1730 0.1150

0.3420 0.1496

0.4020 0.1864

0.3742 0.1676

0.4103 0.1936

0.4410 0.2156

0.4799 0.2494

0.5512 0.2894

14.86% 16.04%

Taobao

Recall MRR

0.0720 0.0640

0.1560 0.1020

0.0647 0.0497

0.1708 0.1407

0.1780 0.1485

0.1760 0.1478

0.1925 0.1617

0.2012 0.1667

0.2426 0.2080

10.09% 18.37%

Xing

Recall MRR

0.2170 0.1260

0.1740 0.1020

0.2072 0.0989

0.2367 0.1170

0.2358 0.1184

0.2540 0.1298

0.2549 0.1339

0.2521 0.1231

0.3025 0.1523

18.67% 13.74%

3.4. Implementation detail We implemented PurchaseNet with PyTorch [12]. As the best performing optimizers vary across datasets and methods, we experimented with two different optimizers, Adam [13] and Adagrad [14] both with default hyperparameters, and tuned the learning rate from {0.0001, 0.001, 0.01, 0.1}. We used the optimizer and the learning rate that performed the best on the validation dataset, and reported the results on the test dataset (see Table 3). Moreover, we used the mini-batch size of 30, and teacher forcing ratio of 0.5 while decoding for better generalization [15]. For all the compared methods, the size of the item embedding and the number of hidden units are set to 150 and 200, respectively. For reliability, we repeat our evaluations five times with different random seeds for the model initialization, and we report the mean test scores. 3.5. Performance analysis 3.5.1. Performance of purchase prediction (RQ1) Table 4 shows the performance of different methods in terms of Recall and MRR. We have the following observations from Table 4. (1) Pptr generally outperforms Psoft , which implies that focusing on the local relationship among clicked items within a session rather than globally considering the entire item set is more important for purchase prediction among non-duplicate click session. (2) Our ultimate proposed method, PurchaseNet, consistently outperforms the baseline methods including Psoft and Pptr . More precisely, while Psoft and Pptr show competitive performance compared with other baselines, the performance considerably improves when jointly combining them to form PurchaseNet. This verifies the benefit of jointly modeling the global relationship among clicked items across sessions, and the local relationship among clicked items within each session. (3) Three

Table 5 Switches of PurchaseNet. PurchaseNet

Dataset

Metric

Switch Binary

Prob.

Yoochoose

Recall MRR

0.5138 0.2678

0.5510 0.2889

Taobao

Recall MRR

0.2185 0.1868

0.2426 0.2080

Xing

Recall MRR

0.2969 0.1471

0.3025 0.1523

neural network-based methods, i.e., MLP, CDAE and GRU4REC, consistently outperform the traditional baselines, and particularly GRU4REC performs the best. This demonstrates that RNNs are suitable for sequence modeling, which justifies the adoption of RNNs in our encoder and decoder. (4) The superior performance of PurchaseNet on Xing dataset verifies that our proposed method can be considered as a general framework that can be applied to prediction tasks in which the output is selected from the input. (5) Although the reported results are still not satisfactory, considering that the task is inherently challenging and that PurchaseNet outperforms several state-of-the-art competitors, we believe that our study paves the way for further studying this practical task. 3.5.2. Comparisons between different types of switches (RQ2) Recall that we introduced a switch gate S(soft) to jointly combine Psoft and Pptr . Table 5 shows the performance comparisons between the binary switch and the probabilistic switch applied to PurchaseNet. We observe that the probabilistic switch consistently outperforms the binary switch, which demonstrates the benefit of the probabilistic modeling. It is worth noting that

Please cite this article as: C. Park, D. Kim and H. Yu, An encoder–decoder switch network for purchase prediction, Knowledge-Based Systems (2019) 104932, https://doi.org/10.1016/j.knosys.2019.104932.

C. Park, D. Kim and H. Yu / Knowledge-Based Systems xxx (xxxx) xxx

7

Table 6 Comparisons of different decoding scheme of PurchaseNet. Dataset

Method

Psoft

Prediction

One-shot

Sequence

One-shot

Pptr Sequence

One-shot

Sequence

Yoochoose

Recall MRR

0.4199 0.1971

0.4410 0.2156

0.4555 0.2217

0.4799 0.2494

0.4528 0.2236

0.5510 0.2889

Taobao

Recall MRR

0.1966 0.1635

0.1925 0.1617

0.2185 0.1758

0.2012 0.1667

0.2158 0.1728

0.2426 0.2080

Xing

Recall MRR

0.2479 0.1187

0.2549 0.1339

0.2502 0.1187

0.2521 0.1231

0.2600 0.1281

0.3025 0.1523

even PurchaseNet with the less effective binary switch outperforms Psoft and Pptr . This again demonstrates the benefit of introducing a switch gate for the joint modeling of global and local item relationship. 3.5.3. Benefit of sequence decoding (vs. one-shot decoding) (RQ3) We compare the performance of different decoding scheme (Table 6) to corroborate our discussion in Section 2.5. We observe that decoding a sequence of purchased items outperforms the oneshot decoding. From the results, we argue that the sequence in which purchases are made should not be overlooked even when decoding. To the best of our knowledge, our work is the first to deal with the decoding scheme of purchases for purchase prediction. 3.5.4. Hyperparameter analysis (RQ4) PurchaseNet contains several hyperparameters to be tuned, such as the size of item embedding (d), the number of hidden units (K ) of encoder and decoder, learning rate, batch size (b), and teacher forcing ratio. In this section, we run experiments over various values of the aforementioned hyperparameters on the Yoochoose and Xing datasets, and show the results in terms of recall and MRR in Figs. 3 and 5. We have the following observations: (1) Choosing an appropriate learning rate is paramount for the model performance. (2) On the other hand, the model performance does not depend much on the remaining hyperparameters, i.e., d, K , b and the teacher forcing ratio, which alleviates the burden of hyperparameter tuning for PurchaseNet. (3) We observe that enforcing too much or too little teacher forcing while decoding of PurchaseNet is harmful for the model performance, and thus we should find an appropriate teacher forcing ratio that differs among datasets. It is worthwhile to note that the predefined setting of the hyperparameters introduced in Section 3.4 do not match with the best performing hyperparameter values shown in Figs. 3, 4, and 5. This is because tuning with respect to all four hyperparameters drastically increases the hyperparameter search space, which makes our proposed method not practical. Therefore, our intention was to set these hyperparameters to moderate values so as to show that our method does not depend too much on the choices of the hyperparameters. For this reason, the best performing parameters in Figs. 3, 4, and 5 do not match with the values that we introduced in Section 3.4. 4. Related work 4.1. Modeling user purchasing behavior in e-commerce With the advent of e-commerce, predicting users’ future behaviors has become vital, and thus has been actively researched [1–5,16,17]. Rendle et al. proposed a prediction model under the assumption that purchased items are more preferred by users than non-purchased items to rank unobserved items for recommendation. Hidasi et al. [1] proposed an RNN-based architecture to model users’ click sessions aiming at predicting next clicks.

PurchaseNet

Wu et al. [2] proposed a denoising autoencoder that reconstructs a user’s behavior to obtain the user’s hidden representation, which is then used to predict user’s future behavior. However, the goal of the aforementioned methods are inherently different from ours in that they predict future purchases (clicks) given previous purchase (click) records, whereas we predict purchases given previous non-duplicate click records. As our work is the first to address this task to the best of our knowledge, we compare PurchaseNet with several recently proposed purchase prediction methods [1,2] in the experiments (Refer to Section 3.2 for details). As another line of research for user behavior prediction, various side information related to users and items, such as user social network [18,19], item images [20,21], review text [22– 25], search query [26], and heterogeneous user behaviors [27,28], have been leveraged. More precisely, a method introduced by Zhao et al. assigned higher ranks to the items that a user’s friends prefer than to the items that neither he nor his friends prefer [18]. This work was extended by Wang et al. [19], who introduced a method that categorizes unobserved items into three groups regarding users’ strong and weak ties with other users. Wang et al. [24] proposed a deep learning-based method to model the review text related to items in order to generate more accurate item embeddings. Yin et al. [27] adopted tensor factorization to jointly model heterogeneous user behaviors, such as add-tocart, add-to-favorite, clicks and purchases, to eventually predict each behavior type. However, this line of research is not directly related to our proposed method in that ours does not consider any side information. Moreover, instead of directly predicting the purchased items, several work focused on understanding behavior of online users [29,30], and specifically to predicting purchase behaviors [31, 32]. As the former line of work, Lo et al. [29] studied user activity and purchasing behaviors that vary over time, especially focusing on user purchasing intent. Cheng et al. [30] extended Lo et al.’s work [29] by generalizing their analysis on characterizing the relationship between a user’s intent and his behavior. Loyola et al. [33] recently proposed an encoder–decoder architecture to model users’ click session together with their intent of the session. However, their goal was mainly to predict the next click within a click sequence, and to predict whether the user has ended up purchasing or not, whereas the specific items to be purchased are not modeled. Our goal is different in that we focus on predicting users’ purchases, rather than predicting users’ various intents from their online behaviors. Meanwhile, as the latter line of work, given user demographics and implicit feedback including click record and purchase record, Liu et al. [32] proposed an ensemble method to predict which customers would return to the same merchant within six months period. They formulated the problem as a classification task and trained various classification methods. While similarly using both purchase record and click record, our task is different in that we aim to predict items that users will purchase rather than to predict repeat buyers. Moreover, Li et al. [31] proposed a MF-based method that predicts the conversion response of users in display advertising, the goal of which inherently differs from our task.

Please cite this article as: C. Park, D. Kim and H. Yu, An encoder–decoder switch network for purchase prediction, Knowledge-Based Systems (2019) 104932, https://doi.org/10.1016/j.knosys.2019.104932.

8

C. Park, D. Kim and H. Yu / Knowledge-Based Systems xxx (xxxx) xxx

Fig. 3. Hyperparameter analysis on Yoochoose dataset.

4.2. CTR prediction for online advertisements Online advertisements are popular means to do promotions or product marketing. Indeed, how to predict the CTR of advertisements to maximize the revenue has been an active research area [34,35]. In particular, logistic regression based click prediction model has been widely used for click prediction [36–38]. However, due to its model capacity, logistic regression does not suffice to describe the latent features of advertisements as well as complicated relationships among them. To this end, matrix factorization (MF) [39], tensor factorization (TF) [40] and factorization machine (FM) [41–43] based click prediction methods have

been proposed. However, while these factorization based methods capture the pairwise relationships of advertisements, they failed to model the high-order interaction among them. Recently, numerous deep learning based methods [1,44,45] have been proposed aiming at solving the aforementioned limitations each of which is based on convolutional neural network (CNN) and recurrent neural network (RNN). Although they demonstrated a significant improvements compared with traditional click prediction methods, their task is inherently different from ours in that while they focus on predicting clicks on online advertisements, we aim to predict users’ future purchase by leveraging users’ click records rather than predicting their clicks.

Please cite this article as: C. Park, D. Kim and H. Yu, An encoder–decoder switch network for purchase prediction, Knowledge-Based Systems (2019) 104932, https://doi.org/10.1016/j.knosys.2019.104932.

C. Park, D. Kim and H. Yu / Knowledge-Based Systems xxx (xxxx) xxx

9

Fig. 4. Hyperparameter analysis on Taobao dataset.

4.3. Encoder–decoder architecture Thanks to the recent advances in deep learning, research fields such as natural language processing (NLP), signal processing, and computer vision have seen unprecedented achievements. Particularly, encoder–decoder architecture has shown its effectiveness for the task of machine translation [8], image captioning [46] and text summarization [47]. For example, in machine translation, a sentence from the source language is encoded using the encoder, and the corresponding sentence from the target language is generated by the decoder. It is also shown that the attention mechanism [9] boosts the decoding performance by selectively

attending to the source sentence. Pointer network [7] is an variant of attention-based encoder–decoder architecture where the output depends on the words in the input sequence. Thanks to its property, pointer network has been widely adopted for text summarization, where given a long document as the input sequence, the goal of the decoder is to generate a summarized version of it. Text summarization is challenging because the words in the input sentence may not appear in the vocabulary, and thus impossible to generate those words [48]. In this case, pointer network can learn to copy the out-of-vocabulary words from the input sequence if they are required for the summary.

Please cite this article as: C. Park, D. Kim and H. Yu, An encoder–decoder switch network for purchase prediction, Knowledge-Based Systems (2019) 104932, https://doi.org/10.1016/j.knosys.2019.104932.

10

C. Park, D. Kim and H. Yu / Knowledge-Based Systems xxx (xxxx) xxx

Fig. 5. Hyperparameter analysis on Xing dataset.

4.4. Traditional recommender system Triggered by the Netflix competition [49] in 2009, the field of recommender system has seen rapid progress. Among various recommendation approaches [50], collaborative filtering (CF) [51] has been widely adopted, thanks to its superior performance. The core idea of CF is that users who have had similar interests in the past will tend to share similar interests in the future. Traditional CF methods can be divided into two major categories: (1) memory-based CF, and (2) model-based CF. Memory-based CF uses similarity measures such as Pearson Correlation or Cosine Similarity between users or items, and recommend items based

on the neighbors. Despite its popularity due to easy implementation and interpretation, it does not scale well to large datasets, and shows poor accuracy when we are given only a small amount of user–item interaction history. On the other hand, model-based CF learns a parametric model to fit the user-item interaction history, such as matrix factorization [3,4,52], which generally results in a better performance than memory-based CF. Our proposed method, PurchaseNet, falls into the category of model-based CF. More precisely, PurchaseNet is a CF method, because our core assumption is that sessions with similar click sequences will result in similar purchased items. Moreover, PurchaseNet is a model-based method in that we learn an encoder

Please cite this article as: C. Park, D. Kim and H. Yu, An encoder–decoder switch network for purchase prediction, Knowledge-Based Systems (2019) 104932, https://doi.org/10.1016/j.knosys.2019.104932.

C. Park, D. Kim and H. Yu / Knowledge-Based Systems xxx (xxxx) xxx

to encode the click sequences, and a decoder to decode the purchase sequences, instead of simply employing a similarity measure based on item co-occurrence as done by ItemKNN, which is one of our baseline method (Section 3.2). However, our setting is distinguished from that of traditional CF in that we consider the ‘‘sequence’’ of click instead of considering the clicked items as an unordered set. Furthermore, although there exist a line of research on sequence-aware recommender systems [1,53–56], our setting is distinguished in that we not only consider the sequence of input (i.e., clicks), but also the sequence of output (i.e., purchases), whereas previous work only consider the sequential dynamics of the input sequence. 5. Conclusion In this paper, we introduced for the first time a worth addressing yet widely overlooked task of purchase prediction among non-duplicate click history, where each item received only one click. Our proposed method, PurchaseNet, not only focuses on the input click sequence within a session by adopting pointer network, but also considers the global interactions among items across sessions, thereby successfully taking into account the interactions among items both locally and globally. We also showed that when given multiple items to predict, the sequence information of purchases should be considered. Through extensive experiments, we demonstrate that PurchaseNet considerably outperforms several state-of-the-art methods. In the real-world e-commerce setting, new items are constantly introduced to the system, which requires incremental update of the model [57] that is costly. As a future work, we plan to address the new item problem by using copying mechanism of pointer network that copies the out-of-set items into the output sequence, rather than incrementally updating the entire model. Moreover, items are often associated with meta data such as description text [23], images [20,21] or reviews [24,58]. As another direction of future work, we plan to incorporate side information related to items to further improve the prediction accuracy. Finally, although we demonstrated the superiority of sequence decoding by using GRU-based decoder over the one-shot decoding scheme, we did not consider the predicted purchase items as an ordered list. This is mainly to make fair comparisons with previous work that considered the output as an unordered list. In this respect, we believe that the next step along this direction is to consider the prediction as an ordered list. Finally, the session data can be preprocessed in a different perspective in a similar vein as ‘‘masked language model’’, which is shown to be effective in the field of natural language processing [59]. For example, s1 = ([v2 , v4 , v5 ], [v4 ]) can be split into s11 = ([v4 ], [v4 ]), s21 = ([v2 , v4 ], [v4 ]), s31 = ([v4 , v5 ], [v4 ]), and s41 = ([v2 , v4 , v5 ], [v4 ]). Although the training time will be increased, preprocessing the session data in this way can potentially lead to improved model performance because the model is trained to predict the correct output with some of the inputs being hidden. Acknowledgments This research was supported by the Ministry of Science and ICT (MSIT), Korea, under the Information and Communication Technology (ICT) Consilience Creative program (IITP-2019-20111-00783) and (IITP-2018-0-00584) supervised by the Institute for Information & communications Technology Planning & Evaluation (IITP), and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the MSIT (NRF-2017M3C4A7063570) and (NRF-2016R1E1A1A01942642).

11

References [1] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, Domonkos Tikk, Session-based recommendations with recurrent neural networks, in: ICLR, 2016. [2] Yao Wu, Christopher DuBois, Alice X. Zheng, Martin Ester, Collaborative denoising auto-encoders for top-n recommender systems, in: WSDM, ACM, 2016. [3] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, Lars SchmidtThieme, Bpr: Bayesian personalized ranking from implicit feedback, in: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, AUAI Press, 2009, pp. 452–461. [4] Rong Pan, Yunhong Zhou, Bin Cao, Nathan N Liu, Rajan Lukose, Martin Scholz, Qiang Yang, One-class collaborative filtering, in: 2008 Eighth IEEE International Conference on Data Mining, IEEE, 2008, pp. 502–511. [5] Yifan Hu, Yehuda Koren, Chris Volinsky, Collaborative filtering for implicit feedback datasets, in: Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on, Ieee, 2008, pp. 263–272. [6] Yehuda Koren, Robert Bell, Chris Volinsky, Matrix factorization techniques for recommender systems, Computer 42 (8) (2009). [7] Oriol Vinyals, Meire Fortunato, Navdeep Jaitly, Pointer networks, in: NIPS, 2015. [8] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio, Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. [9] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio, Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv: 1409.0473, 2014. [10] Denny Britz, Melody Y. Guan, Minh-Thang Luong, Efficient attention using a fixed-size memory representation. arXiv preprint arXiv:1707.00110, 2017. [11] Shiv Shankar, Siddhant Garg, Sunita Sarawagi, Surprisingly easy hardattention for sequence to sequence learning, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 640–645. [12] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, Adam Lerer, Automatic Differentiation in PyTorch, 2017. [13] Diederik P. Kingma, Jimmy Ba, Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [14] John Duchi, Elad Hazan, Yoram Singer, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res. (2011). [15] Ronald J. Williams, David Zipser, A learning algorithm for continually running fully recurrent neural networks, Neural Comput. (1989). [16] Xiangnan He, Hanwang Zhang, Min-Yen Kan, Tat-Seng Chua, Fast matrix factorization for online recommendation with implicit feedback, in: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2016, pp. 549–558. [17] Humphrey Sheil, Omer Rana, Ronan Reilly, Predicting purchasing intent: Automatic feature learning using recurrent neural networks. arXiv preprint arXiv:1807.08207, 2018. [18] Tong Zhao, Julian McAuley, Irwin King, Leveraging social connections to improve personalized ranking for collaborative filtering, in: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, ACM, 2014, pp. 261–270. [19] Xin Wang, Wei Lu, Martin Ester, Can Wang, Chun Chen, Social recommendation with strong and weak ties, in: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, ACM, 2016, pp. 5–14. [20] Ruining He, Julian McAuley, Vbpr: visual bayesian personalized ranking from implicit feedback, in: Thirtieth AAAI Conference on Artificial Intelligence, 2016. [21] Chanyoung Park, Donghyun Kim, Jinoh Oh, Hwanjo Yu, Do also-viewed products help user rating prediction?, in: Proceedings of the 26th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, 2017, pp. 1113–1122. [22] Lei Zheng, Vahid Noroozi, Philip S. Yu, Joint deep modeling of users and items using reviews for recommendation, in: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, ACM, 2017, pp. 425–434. [23] Donghyun Kim, Chanyoung Park, Jinoh Oh, Sungyoung Lee, Hwanjo Yu, Convolutional matrix factorization for document context-aware recommendation, in: Proceedings of the 10th ACM Conference on Recommender Systems, ACM, 2016, pp. 233–240. [24] Hao Wang, Naiyan Wang, Dit-Yan Yeung, Collaborative deep learning for recommender systems, in: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2015, pp. 1235–1244.

Please cite this article as: C. Park, D. Kim and H. Yu, An encoder–decoder switch network for purchase prediction, Knowledge-Based Systems (2019) 104932, https://doi.org/10.1016/j.knosys.2019.104932.

12

C. Park, D. Kim and H. Yu / Knowledge-Based Systems xxx (xxxx) xxx

[25] Chong Wang, David M. Blei, Collaborative topic modeling for recommending scientific articles, in: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2011, pp. 448–456. [26] Chao-Yuan Wu, Amr Ahmed, Gowtham Ramani Kumar, Ritendra Datta, Predicting latent structured intents from shopping queries, in: Proceedings of the 26th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, 2017, pp. 1133–1141. [27] Hongzhi Yin, Hongxu Chen, Xiaoshuai Sun, Hao Wang, Yang Wang, Quoc Viet Hung Nguyen, Sptf: a scalable probabilistic tensor factorization model for semantic-aware behavior prediction, in: 2017 IEEE International Conference on Data Mining (ICDM), IEEE, 2017, pp. 585–594. [28] Hongxu Chen, Hongzhi Yin, Weiqing Wang, Hao Wang, Quoc Viet Hung Nguyen, Xue Li, Pme: projected metric embedding on heterogeneous networks for link prediction, in: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ACM, 2018, pp. 1177–1186. [29] Caroline Lo, Dan Frankowski, Jure Leskovec, Understanding behaviors that lead to purchasing: A case study of pinterest, in: SIGKDD, ACM, 2016. [30] Justin Cheng, Caroline Lo, Jure Leskovec, Predicting intent using activity logs: How goal specificity and temporal range affect user behavior, in: WWW, 2017. [31] Sheng Li, Jaya Kawale, Yun Fu, Predicting user behavior in display advertising via dynamic collective matrix factorization, in: SIGIR, ACM, 2015. [32] Guimei Liu, Tam T Nguyen, Gang Zhao, Wei Zha, Jianbo Yang, Jianneng Cao, Min Wu, Peilin Zhao, Wei Chen, Repeat buyer prediction for e-commerce, in: SIGKDD, ACM, 2016. [33] Pablo Loyola, Chen Liu, Yu Hirate, Modeling user session and intent with an attention-based encoder-decoder architecture, in: Proceedings of the Eleventh ACM Conference on Recommender Systems, ACM, 2017, pp. 147–151. [34] Aleksandr Chuklin, Ilya Markov, Maarten de Rijke, Click models for web search. Synthesis Lectures on Information Concepts, Retrieval, and Services. [35] Zeyuan Allen Zhu, Weizhu Chen, Tom Minka, Chenguang Zhu, Zheng Chen, A novel click model and its applications to online advertising, in: WSDM, ACM, 2010. [36] Olivier Chapelle, Eren Manavoglu, Romer Rosales, Simple and scalable response prediction for display advertising, TIST (2015). [37] Matthew Richardson, Ewa Dominowska, Robert Ragno, Predicting clicks: estimating the click-through rate for new ads, in: WWW, ACM, 2007. [38] H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et al., Ad click prediction: a view from the trenches, in: SIGKDD, ACM, 2013. [39] Aditya Krishna Menon, Krishna-Prasad Chitrapura, Sachin Garg, Deepak Agarwal, Nagaraj Kota, Response prediction using collaborative filtering with hierarchies and side-information, in: SIGKDD, ACM, 2011. [40] Lili Shan, Lei Lin, Chengjie Sun, Xiaolong Wang, Predicting ad click-through rates via feature-based fully coupled interaction tensor factorization, Electron. Commer. Res. Appl. (2016).

[41] Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, Chih-Jen Lin, Field-aware factorization machines for CTR prediction, in: RecSys, ACM, 2016. [42] Richard J Oentaryo, Ee-Peng Lim, Jia-Wei Low, David Lo, Michael Finegold, Predicting response in mobile advertising with hierarchical importance-aware factorization machine, in: WSDM, ACM, 2014. [43] Steffen Rendle, Social network and click-through prediction with factorization machines, in: KDD Cup, 2012. [44] Qiang Liu, Feng Yu, Shu Wu, Liang Wang, A convolutional click prediction model, in: CIKM, ACM, 2015. [45] Yuyu Zhang, Hanjun Dai, Chang Xu, Jun Feng, Taifeng Wang, Jiang Bian, Bin Wang, Tie-Yan Liu, Sequential click prediction for sponsored search with recurrent neural networks. arXiv preprint arXiv:1404.5772, 2014. [46] Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan, Show and tell: A neural image caption generator, in: CVPR, 2015, pp. 3156–3164. [47] Abigail See, Peter J. Liu, Christopher D. Manning, Get to the point: Summarization with pointer-generator networks, in: ACL, 2017. [48] Jiatao Gu, Zhengdong Lu, Hang Li, Victor O.K. Li, Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603. 06393, 2016. [49] James Bennett, Stan Lanning, et al., The Netflix Prize, 2007. [50] Jesús Bobadilla, Fernando Ortega, Antonio Hernando, Abraham Gutiérrez, Recommender systems survey, Knowl.-Based Syst. 46 (2013) 109–132. [51] Xiaoyuan Su, Taghi M. Khoshgoftaar, A survey of collaborative filtering techniques, Adv. Artif. Intell. 2009 (2009). [52] Yehuda Koren, Factorization meets the neighborhood: a multifaceted collaborative filtering model, in: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2008, pp. 426–434. [53] Massimo Quadrana, Paolo Cremonesi, Dietmar Jannach, Sequence-aware recommender systems, ACM Comput. Surv. 51 (4) (2018) 66. [54] Bartłomiej Twardowski, Modelling contextual information in session-aware recommender systems with neural networks, in: Proceedings of the 10th ACM Conference on Recommender Systems, ACM, 2016, pp. 273–276. [55] Zhiyong Cheng, Jialie Shen, Lei Zhu, Mohan S Kankanhalli, Liqiang Nie, Exploiting music play sequence for music recommendation, in: IJCAI, Vol. 17, 2017, pp. 3654–3660. [56] Ruining He, Julian McAuley, Fusing similarity models with markov chains for sparse sequential recommendation, in: 2016 IEEE 16th International Conference on Data Mining (ICDM), IEEE, 2016, pp. 191–200. [57] Xin Luo, Yunni Xia, Qingsheng Zhu, Incremental collaborative filtering recommender based on regularized matrix factorization, Knowl.-Based Syst. (2012). [58] Sungyong Seo, Jing Huang, Hao Yang, Yan Liu, Interpretable convolutional neural networks with dual local and global attention for review rating prediction, in: Proceedings of the Eleventh ACM Conference on Recommender Systems, ACM, 2017, pp. 297–305. [59] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Please cite this article as: C. Park, D. Kim and H. Yu, An encoder–decoder switch network for purchase prediction, Knowledge-Based Systems (2019) 104932, https://doi.org/10.1016/j.knosys.2019.104932.