Next-song recommendation with temporal dynamics

Next-song recommendation with temporal dynamics

Knowledge-Based Systems 88 (2015) 134–143 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locat...

2MB Sizes 27 Downloads 78 Views

Knowledge-Based Systems 88 (2015) 134–143

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Next-song recommendation with temporal dynamics Ke Ji a,b, Runyuan Sun a,b,⇑, Wenhao Shu c, Xiang Li d a

School of Information Science and Engineering, University of Jinan, Jinan, China Shandong Provincial Key Laboratory of Network Based Intelligent Computing, University of Jinan, Jinan, China c School of Information Engineering, East China Jiaotong University, Nanchang, China d Psychology Department of Renmin University, Beijing, China b

a r t i c l e

i n f o

Article history: Received 13 April 2015 Received in revised form 2 July 2015 Accepted 31 July 2015 Available online 6 August 2015 Keywords: Music recommendation Music playlist Markov embedding Temporal dynamics Sequence prediction

a b s t r a c t Music recommendation has become an important way to reduce users’ burden in discovering songs that meet their interest from a large-scale online music site. Compared with general behavior, user listening behavior has a very strong time dependence in that users frequently change their music interest in different sessions, where the concept of a ‘‘session” is that of a single user continuously listening songs over a period of time. However, most existing methods ignore temporal dynamics of both users and songs across sessions. In this paper, we analyze the temporal characters of a real music dataset from Last.fm and propose Time-based Markov Embedding (TME), a next-song recommendation model via Latent Markov Embedding, which boost the recommendation performance by leveraging temporal information. Specifically, we consider a scenario where user music interest is affected by long-term, short-term and session-term effects. By capturing temporal dynamics in the three effects, our model can track the change of user interest over time. We have conducted experiments on Last.fm dataset. Results demonstrate that with our time-based model, the recommendation accuracy is significantly improved compared to other state-of-the-art methods. Ó 2015 Elsevier B.V. All rights reserved.

1. Introduction Music plays an important role in our daily lives, having become one of the most popular entertainment forms. The rapid development of the Internet brings great convenience to people’s musical life, largely out of expensive price and storage limit of traditional records and CDs. Especially, with growing use of the mobile Internet and applications, people can surf a lot of online music sites (e.g., Last.fm,1 Pandora2) anytime and anywhere for enjoying music. But as time goes on, the music sites have a sharp increase in the number of songs which makes it hard for users to seek out songs that meet their interest from massive amounts of songs [1]. So music recommendation system [2] that can automatically generate the most interesting songs aimed at different users’ interest is particularly important for online music providers to improve the quality of the user services and promote the development of music. Music recommendation is a playlist prediction problem that generates a list of songs which is related to the given songs of a ⇑ Corresponding author at: School of Information Science and Engineering, University of Jinan, Jinan, China. E-mail address: [email protected] (R. Sun). 1 http://www.last.fm. 2 http://www.pandora.com. http://dx.doi.org/10.1016/j.knosys.2015.07.039 0950-7051/Ó 2015 Elsevier B.V. All rights reserved.

session. Because of the great commercial demand, there have been many studies on music recommendation [3], which can be classified into three categories: content-based methods [4–6], context-based methods [7–9] and collaborative filtering [10–12]. Furthermore, some hybrid approaches [13–20] have been proposed for merging these methods in order to overcome the shortcomings of single model. Most of these methods consider playlist prediction as an information retrieval problem, while ignoring the sequential nature of playlists. Sequence prediction is a novel emerged research area of music recommendation in recent years. It reduces the playlist prediction to a sequence modeling setting that is more concerned with the order of songs and the transitions between two adjacent songs in the playlists. Since it is similar to Markov chain approach in natural language processing, some Markov chain-based approaches [21–23] are incorporated to solve the sequence prediction problem. Recent works on Markov embedding of multi-dimensional latent space have been done [24,25]. Experiments show that Markov embedding models substantially outperform traditional n-gram Markov model. Although these methods address playlist prediction to a large extent, they ignore temporal information. Compared with general recommendation scenario, strong time dependence is a distinctive

K. Ji et al. / Knowledge-Based Systems 88 (2015) 134–143

property of music recommendation. For a song, popularity can affect user listening behaviors. For instance, the song ‘‘My Heart Will Go On” can earn many more listeners on the 100th anniversary of the most famous accident of Titanic, but cannot get so many mentions on normal days. For a user, interest goes through concept drifting [26,27]. For instance, it is common for users to have rise and decay of music interest instead of the fixed interest. For a session, various reasons lead to its formation, not just in terms of user interest. For instance, some songs in a session do not meet user previous interest, but he still listens to them because of curiosity or influence of other activities. These temporal changes bring unique challenges in music recommendation. So we require a dynamic model, which can track the changes of user interest, and make dynamic adjustments to the prediction model, so as to provide real-time music recommendation. To address the above challenges, we propose Time-based Markow Embedding (TME), a dynamic next-song recommendation model via Latent Markov Embedding. In addition to the embedded users and songs [25], we also embed sessions into a Euclidean space. Specifically, our whole embedding considers three aspects: distance between users and songs (long-term effect) reflects user global interest; distance between songs and songs (short-term effect) reflects the sudden song transitions that have low impact on long-term; distance between sessions and song (session-term effect) reflects user local interest in each session. By modeling temporal dynamics in the three effects, we extend the effects into the time changing effects. Our model is designed to integrate the time changing effects, and is able to track the change of user interest over time. We have conducted experiments on a real music dataset from Last.fm. The empirical results and analysis demonstrate that our model outperforms other state-of-the-art Markov chainbased methods. The rest of this paper is organized as follows. We first give an overview of related work in Section 2. Then, we present the preliminaries, including data temporal analysis and problem setting in Section 3. We describe our model in detail in Section 3. We report the experimental results and analysis in Section 4. Finally, we provide a conclusion in Section 5.

135

outperforms content-based and Context-based on the rating type of data. What’s worth mentioning is that collaborative filteringmethods have been widely used because of their simple practicality, easy to achieve, high accuracy and so on. Some hybrid approaches [13–20] have been proposed to avoid certain limitations of single method. In general, these methods take music recommendation as a similar retrieval problem. However, unlike traditional recommender system, there are many unique historical playlists existing in user sessions. Every playlist is a sequence of songs where the order of songs imply user’s sequential listening behavior. Recently, sequence prediction has become one of the research emphases in music recommendation. In contrast to previous methods, sequence prediction can be seem as a language modeling setting. Markov chain has been incorporated to solve the problem. For example, smoothed n-gram model [21] is the most common method for machine translation and automatic spelling correction. In particular, Bigram is a firstorder Markov model where transition probabilities are estimated directly for every pair of songs. McFee et al. [22] propose a playlist generation algorithm which characterize playlist as generative models of strings of songs belonging to some unknown language. Chen et al. [24] proposed a new family of learning algorithms LME which formulate playlist prediction as a regularized maximum-likelihood embedding of Markov chains in Euclidian space. Wu et al. [25] proposed a personalized next-song recommendation algorithm PME, extending LME by embedding users into the Euclidian space. Over the extensive testing in [24,25], the embedding-based models-LME and PME outperform the statistics-based n-gram models. Often, as time goes on, user interest is evolving. Some interesting activities that happen to users, songs or sessions may lead to a temporary change of user interest. But these methods stay fixed after the model training, and cannot be continuously updated to reflect user present interest. The goal of our work is to track the changes of user interest by leveraging temporal dynamics.

3. Preliminaries 2. Related work

3.1. Data temporal analysis

Music recommendation has been a fast growing research field of recommender systems. There are several traditional approaches to music recommendation [3].

In this paper, the time drifting data we use is the log of Last.fm and Table 2 shows the basic statistics of the dataset. Based on the temporal information existing in the dataset, we analysis temporal dynamics in user, song and session. User. Fig. 1(a) shows the statistics of a selected user’s changes in listening behavior since his first listening event. The songs that have been listened by the user are randomly divided into four groups. We set a 1-week coarse time resolution, which is sufficient for the rather slow changing listening behavior in practice. For each group of songs, we calculate the number of times the songs listened by the user, and have found that the user pays different attention to them over time based on the four curves that sometimes smooth, sometimes bumpy. This indicates user global interest would change over time. Song. Fig. 1(b) shows the statistics of a selected song-the number of times the song was listened and the number of the song’s listeners since its first listened event. We set a 2-week coarse time resolution, which is sufficient for the rather slow changing song popularity in practice. We have found that the number of times first increases fast, and slowly declines, and then suddenly increases, which also appear for other songs again and again. The number of listeners has a similar trend, but less than the number of times. This indicates song popularity can reflect user current behavioral tendency.

 Content-based [4,5] methods extract features directly from the audio signal, compute the similarity among songs based on the features, and recommend songs with high similarity to users. A overview of content-based methods is given by [6].  Context-based [7–9] methods take context information (e.g., location, emotion and so on) into account. The usage of context information is crucial in music systems because people prefer different songs for different activities such as working, sleeping and studying.  Collaborative filtering [12] methods use explicit feedback or implicit feedback to track user listening habits, which need not any additional information. It can be grouped into two general classes: memory-based[28] and model-based [29]. Content-based methods need audio content analysis that increases computation cost, and cannot achieve rapid prediction in large-scale systems. When context information is not available, context-based methods are inapplicable. Collaborative filtering methods suffer from some problems, such as cold start and sparsity. It is proved by practice [13,9] thatcollaborative filtering

136

K. Ji et al. / Knowledge-Based Systems 88 (2015) 134–143

Fig. 1. Temporal dynamics in user, song and session.

Session. We perform PME model on Last.fm data. A visualization of the trained PME in R2 with a selected user and songs is shown in Fig. 1(c), where the user is represented by red circle and 3 randomly picked user’s sessions are represented by circles with different colors. It shows that the embedding position of all songs in the three sessions is near to user. If we look deeper, we find the songs in different sessions are on pretty distant terms with one another, but the songs in same session cluster tightly. This indicates the user has local interest in different sessions. Besides, some outliers appear in the same session, which can be seen as the change of local interest.

Table 1 List of symbols. Symbol

Description

U; S D d

User set, song set The entire playlist database The dimension of the Euclidean space

u; m s; n pu pu:c

The The The The The

ðiÞ

pu;c Dpu;c ;sa ;sb

3.2. Problem setting Given a user-generated playlist in a session, our task is to recommend a ranked list of songs which the current user may go on to listen to next. We use U ¼ fu1 ; u2 ; . . . ; um g to denote the set of users and S ¼ fs1 ; s2 ; . . . ; sn g to denote the set of songs. Let   ðjp jÞ ð1Þ ð2Þ be a history playlist pu;c of user u’s pu;c ¼ pu;c ; pu;c ; . . . ; pu;cu;c ðkÞ

c-th session, where pu;c is the k-th song. All sessions from user u   form the set pu ¼ pu;1 ; pu;2 ; . . . ; pu;jpu j . Then, the entire playlist database can be represented as D ¼ fðu; pu Þju 2 U g. And Dpu;c ;sa ;sb marked by a timestamp t denotes one instance for two adjacent songs sa and sb in the playlist pu;c of user u’s c-th session. These major symbols we use throughout the paper are summarized in Table 1.

user u 2 U, the number of U song s 2 S, the number of S set of the playlists in all user u’s sessions playlist pu;c in user u’s c-th session i-th song of the playlist pu;c

Two adjacent songs sa and sb in the playlist pu;c d-dimensional Euclidean space

Rd xðuÞ

User u’s vector in Rd space

yðsÞ   z pu;c

Item s’s vector in Rd space Session pu;c ’s vector in Rd space User u’s vectors in all time slices User u’s vector in i-th time slice

Y uT yi ðuÞ TU TB Nums ðt Þ Bs ðtÞ W u;c ðiÞ

wu;c T u;c ðiÞ kU , kB ; kS , kX , kY ; kZ ; kT

The length of the time slice The length of time The number of s listened by users before time t Song s’s popularity in a period time T B before time t, The weight vector over the songs of pu;c The weight value for i-th song of pu;c The time spent by user u on i-th song of pu;c The regularization parameters

4. Next-song recommendation model We first introduce Latent Markov Embedding in Section 4.1. After that, we describe the way to measure users’ long-term, short-term and session-term effects by our model in Section 4.2. Then, we discuss how to incorporate temporal dynamics in user, song and session into the three effects in Section 4.3. Finally, we present our whole Time-based Markow Embedding model which leverages the time changing effects in a integrated way in Section 4.4. A summary of the recommendation task in our model can be found in Fig. 2.

PrðpÞ ¼

kp   Y ðj1Þ Pr pðjÞ u;c jpu;c

ð1Þ

i¼1

LME [24] treats the playlists as Markov chain in latent space, where all songs are mapped into a d-dimension Euclidean space Rd and the vector xðsÞ is a representation of song s in this space. Given two songs sa and sb , the Euclidean distance kxðsa Þ  xðsb Þk2 reflects the strength of their relationship: if their distance is close, user may well listen to sb next after listening to sa . Then the transition probability Prðsb jsa Þ from sa to sb is proportional to 2

4.1. Latent Markov embedding Modeling the sequential listening behavior is a central problem in music recommendation. A popular approach to sequence prediction is to model playlists as Markov chain, where the probability of     ðjp jÞ ð1Þ ð2Þ a playlist pu;c ¼ pu;c ; pu;c ; . . . ; pu;cu;c containing pu;c  songs is rep  ðjÞ ðj1Þ resented as the product of the transition probability Pr pu;c jpu;c ðj1Þ

between two adjacent songs pu;c

ðjÞ

and pu;c .

ekxðsa Þxðsb Þk2 . However, the basic model ignores the personality of user interest, cannot generate personalized recommendations for different users. Wu et al. [25] propose Personalized Markov Embedding (PME), extending [24] by embedding users into the space, i.e., using vector yðuÞ to denote user u. The transition probability Prðsb jsa Þ from sa to sb made by user u is redefined as Prðsb jsa ; uÞ that is pro2

2

portional to ekyðuÞxðsb Þk2 and ekxðsa Þxðsb Þk2 . The following Section 4.2 will give a detailed description of this formula.

137

K. Ji et al. / Knowledge-Based Systems 88 (2015) 134–143

Fig. 2. Illustration of our model. Our goal is to combine the long-term and short-term effects as well as session-term effect for next-song recommendation. Specifically, our model can track the changes of user interest over time by incorporating temporal dynamics in user, song and session into the three effects respectively.

Based on the existing playlists, the training of the embedding of songs and users can be solved through a maximum likelihood approach (Eq. (2)).

ðX; Y Þ ¼ arg max X;Y

Y

jpu j jp u;i j   Y Y ðjÞ ðj1Þ Pr pu;i jpu;i ; u

ð2Þ

ðu;pu Þ2D i¼1 j¼1

The optimization problem can be solved using stochastic gradient method. 4.2. Modeling long-term, short-term and session-term effects Often, user listening behavior is determined by multiple factors. Wu et al. [25] presented a way to combine the long-term and short-term effects [30], where long-term reflects user global interest and short-term reflects the sequential nature of listening. For the convenience of modeling the two effects, it decomposes the transition Prðsb jsa ; uÞ into Prðsb juÞ and Prðsb jsa Þ that are viewed as user u’s long-term and short-term effects respectively. 2

ekyðuÞxðsb Þk2 Prðsb juÞ ¼ PjSj kyðuÞxðsj Þk22 j¼1 e

ð3Þ

2

ekxðsa Þxðsb Þk2 Prðsb jsa Þ ¼ PjSj kxðsa Þxðsj Þk22 j¼1 e PjSj

ð4Þ PjSj

kyðuÞxðsj Þk22 and j¼1 e P ensure jSj Pr ð s ju b Þ ¼ 1 b¼1

where

kxðsa Þxðsj Þk22 are normalized j¼1 e P and jSj Pr ð s js b a Þ ¼ 1. b¼1

     Pr sb jpu;c ; sa ; u ¼ Prðsb juÞPrðsb jsa ÞPr sb jz pu;c

2

session pu;t and song sb reflects how well song sb matches the local interest of session pu;c . Analogously, we define the transition prob   ability Pr sb jz pu;c . 2

ð5Þ

ð6Þ

Thus, the session-term effect ignored by LMEand PME is incorporated by our model. 4.3. Integration with temporal dynamics In the previous subsection, we describe our basic model for combining long-term, short-term and session-term effects. However, once the training is complete, the model is fixed, cannot be continuously updated to reflect user present interest. We will introduce our approaches to modeling temporal dynamics in the three effects. Time changing long-term effect. As shown in Fig. 1(a), user global interest is gradually changing. We divide the entire time period into time slices f½0; T U ; ½T U ; 2T U ; . . . ; ½nT U ; ðn þ 1ÞT U g of equal length T U , and every slice is assigned a embedding  Y uT ¼ y1 ðuÞ; y2 ðuÞ; . . . ; yn ðuÞ for user u, where yt ðuÞ is an element u of Y T corresponding to the slice which time t belongs to. yðuÞ indicates the global bias of user interest from the average, and yt ðuÞ indicates the gradual drifting of the bias, extending Eq. (3) to the following transition: 2

factors

that As stated in Section 3.1, every session presents user’s local interest. To incorporate the session interest, we map sessions into points in the space Rd , and represent each user u’s session pu;c as a       vector z pu;c . The Euclidean distance z pu;c  xðsb Þ between

   ekzðpu;c Þxðsb Þk2 Pr sb jz pu;c ¼ PjSj kzðpu;c Þxðsj Þk22 j¼1 e

Then, by combining the above three effects, the whole transition model of two adjacent songs sa and sb in the playlist pu;c of user u’s c-th session is.

2

ekyðuÞxðsb Þk2 kU kyt ðuÞxðsb Þk2 Prðsb ju; t Þ ¼ PjSj kyðuÞxðsj Þk22 kU kyt ðuÞxðsj Þk22 j¼1 e

ð7Þ

where kU controls the influence of global interest drifting on user’s long-term effect. Note that Eq. (7) can be dynamically updated with the change of yt ðuÞ over time. Time changing short-term effect. As shown in Fig. 1(b), listening behavior seems to be influenced by songs’ popularity-that is, despite having little interest in one song, user may still listen to it under the effect of its popularity. So the short-term transition Prðsb jsa Þ should be dynamically changed. Given a song s and current time t since s is first listened, Nums ðt Þ denote the number of s listened by users before time t. Then in a period time T B (e.g., number of days) before time t; s’s popularity is represented as:

138

K. Ji et al. / Knowledge-Based Systems 88 (2015) 134–143

Bs ðt Þ ¼ Nums ðtÞ  Nums ðt  T B Þ

ð8Þ

X

     L DjX;Y; Z; Y uT ¼ ln Prðsb ju; tÞ þ ln Pr sb jsa ; t þ ln Pr sb jpu;c ; t Dpu;c ;sa ;sb

We introduce songs’ popularity as an additive bias component to model the interest drifting in short-term effect so that we can get the definition of Prðsb jsa ; t Þ, extending Eq. (4) to the following transition:

¼

X

"

X  2 2 2 t Dðu; sb Þ  kU D ut ; sb  ln eDðu;sj Þ kU Dðu ;sj Þ

Dpu;c ;sa ;sb

j2S

! X Dðs ;s Þ2 þk B ðtÞ  2  D pu;j ; sb Dðsa ;sb Þ þ kB Bsb ðtÞ  ln e a j B sj 2

j2S

2

ekXðsa ÞXðsb Þk2 þkB Bsb ðtÞ Prðsb jsa ; tÞ ¼ P jSj kXðsa ÞXðsj Þk22 þkB Bsj ðtÞ j¼1 e

!

2

X 2 2 0 2 kS Dðu0 ;sb Þ  ln eDðpu;c ;sj Þ kS Dðu ;sj Þ

ð9Þ

!# ð14Þ

j2S

where kB controls the influence of songs’ popularity on user’s shortterm effect. Note that Eq. (9) can be dynamically updated with the change of Bs ðt Þ over time. Time changing Session-term effect. Given a user u’s session     ð1Þ ð2Þ ðn0 Þ pu;c ¼ pu;c ; pu;c ; . . . ; pu;c , z pu;c reflect user’s local interest in this

where Dðsa ; sb Þ denote the distance kxðsa Þ  yðsb Þk2 , Dðu; sb Þ denote the distance kyðuÞ  xðsb Þk2 , Dðut ; sb Þ denote the distance kyt ðuÞ  xðsb Þk2 and Dðu0 ; sb Þ denote the distance ky0 ðuÞ  xðsb Þk2 . In order to avoid overfitting, we add the regularization terms of

session. However, as shown in Fig. 1(c), not every song of pu;c reflects user’s present interest. For instance, user may click some songs by chance, but not spends much time listening to them. To distinguish the interest distinction on the songs, we define a   ð1Þ ð2Þ ðn0 Þ weight vector W u;c ¼ wu;c ; wu;c ; . . . ; wu;c over the songs of pu;c P ðiÞ that has the normalization constraint i2pu;c wu;c ¼ 1. The idea is

    X; Y; Z; Y uT ¼ arg maxu L DjX; Y; Z; Y uT  kX kX k2Frob

that a song fully listened by user would better match user’s present interest than another song partially listened does. We then the ðkÞ

assign weight value wu;c to be proportional to the listening time spent by user on the song as follows:

P wðkÞ u;c ¼

T u;c ðkÞ

ð10Þ

i2pu;j T u;c ðiÞ ðkÞ

where T u;c ðkÞ is the time spent by user u on the song pu;c . We intro  duce a new vector called y0 pu;c

  X ðiÞ y0 pu;c ¼ wu;c xðiÞ

ð11Þ

i2pu;c

  where z pu;c indicates the session interest bias from the average, and y0 ðuÞ indicates session interest drifting on the bias, extending Eq. (5) to the following transition: 0   ekzðpu;j Þxðsb Þk2 kS ky ðuÞxðsb Þk2 Pr sb jpu;c ; t ¼ PjSj 2 kzðpu;c Þxðsj Þk2 kS ky0 ðuÞxðsj Þk22 j¼1 e 2

2

ð12Þ

where kZ controls the influence of session interest drifting on the session-term effect. Note that Eq. (12) can be dynamically updated with the change of y0 ðuÞ when new songs are added to pu;c over time. 4.4. An integrated model and optimization After incorporating temporary dynamics, a maximum likelihood approach is used to learn the embedding vectors

    X ¼ ½xðs1 Þ; . . . ; xðsn Þ, Y ¼ ½yðu1 Þ; . . . ; yðum Þ, Z ¼ z p1;1 ; . . . ; y pm;n0

1 u 2 n and Y T ¼ y ðuÞ; y ðuÞ; . . . ; y ðuÞ . The existing playlists in D lead to the following training problem:



 X; Y; Z; Y uT ¼ arg maxu X;Y;Z;Y T

¼ arg maxu X;Y;Z;Y T

¼ arg maxu X;Y;Z;Y T

Y Dpu;c ;sa ;sb 2D

Y

Dpu;c ;sa ;sb 2D

Y

  Pr Dpu;j ;sa ;sb

  Pr sb jpu;c ; sa ; u   Prðsb ju; tÞPrðsb jsa ; t ÞPr sb jpu;c ; t

Dpu;c ;sa ;sb 2D

ð13Þ Eq. (13) is equivalent to maximize the log-likelihood (Eq. (14)):

Frobenius norm (kk2Frob ) to the embedding vectors, and have X;Y;Z;Y T

 kY kY k2Frob  kZ kZ k2Frob  kT   ¼ arg maxu L0 DjX; Y; Z; Y uT

X u 2 Y  T Frob u2U

X;Y;Z;Y T

ð15Þ

And now, keeping the parameters (kX ; kY , kZ and kT ) fixed, we use stochastic gradient descent method to solve the optimization problem for maximizing the log-likelihood with regularization terms (Eq. (15)). The method iteratively runs the updating rules through the derivatives for all the transitions Dpu;c ;sa ;sb in D:

xðsa Þ ¼ xðsa Þ þ s xðsb Þ ¼ xðsb Þ þ s

@lðDpu;c ;sa ;sb Þ @xðsa Þ

@lðDpu;c ;sa ;sb Þ @xðsb Þ

@lðDpu;c ;sa ;sb Þ

yðuÞ ¼ yðuÞ þ s @yðuÞ     @lðDp ;s ;s Þ z pu;c ¼ z pu;c þ s @z u;cp a b ð u;c Þ @lðDpu;c ;sa ;sb Þ yt ðuÞ ¼ yt ðuÞ þ s @y ðuÞ t

  is one ‘‘local” log-likelihood term of where l Dpu;c ;sa ;sb   L0 DjX; Y; Z; Y uT that is concerned with the transition from sa and sb in the playlist pu;c of user u’s c-th session and s is the learning rate. The derivatives of the term are shown in Appendix A. 5. Experiments 5.1. Dataset In this paper, we evaluate our model on a real music dataset, lastfm-dataset-1K.3 Last.fm4 is a music website, founded in the United Kingdom in 2002, providing a music recommendation service. Users can use Last.fm by signing up and downloading the Scrobbler, which helps users discover more music based on the songs users have listened to. This dataset represents the whole listening habits (till May, 5th 2009) for 992 users. But the original data is very large and sparse. To reduce noise songs, we only select the songs which have been listened by at least 100 users in the playlists. We segment every user’s listening records into different sessions. For the session segmentation, we ensure the user’s inactive intervals between adjacent songs in the sessions are less than 1 h, otherwise make segmentation between adjacent songs. We put the last transition of songs from every session into the test set, the rest for the training set and ensure that every test song exists in the 3 4

http://ocelma.net/MusicRecommendationDataset/index.html. http://www.last.fm/.

139

K. Ji et al. / Knowledge-Based Systems 88 (2015) 134–143

training set. After the model optimization on the training set, we recommend songs to every session and evaluate the recommendation accuracy on the test set. The statistics of the final dataset selected by us is summarized in Table 2.

Table 2 Statistics of the dataset. #Users

#Songs

#Sessions

#Training transitions

#Test transitions

992

12621

1260106

10869336

119523

5.2. Metrics We use four popular ranking metrics-Recall, Precision, F1-score and MAP to measure the prediction performance of our proposed model in comparison with other state-of-the-art methods. Recall is defined as the ratio of listened songs N rs to total number of relevant songs N r and Precision is defined as the ratio of listened songs N rs to number of recommended songs N s .

Recall ¼

Nrs Nrs ; Precision ¼ Nr Ns

F1-score combines Precision and Recall into a single metric:

F 1  score ¼

2  Precision  Recall Precision þ Recall

MAP is the mean of the average precision scores:

P MAP ¼

ðu;pu Þ2D AP ðu; pu Þ

jDj

where AP ðu; pu Þ is the average precision of session ðu; pu Þ. As we only evaluate top  k recommendation, these metrics are represented by Precision@k, Recall@k, F1-score@k and MAP@k for the top  k ranked list. 5.3. Comparison and implementation In order to show the improvement of our proposed model, we implement the following baselines for the comparison with TME.  Bigram Model (Bigram) [21] is a first-order Markov model. The main idea is use the probability that the pairs of songs appear in the training set to find classes that have some loose semantic coherence. Transition probabilities Prðsb jsa Þ are estimated for every of songs. Similar to [24], we adopt the witten–Bell smoothing technique to solve minus infinity log likelihood contribution.  Personalized Bigram Model (PBigram) [25] is an assembled algorithm which introduces a smoothed bigram Prðsb juÞ. Each transition from song sa and sb by user u is sampled with the multiplication of two bigrams Prðsb jsa Þ and Prðsb juÞ.  Logistic Markov Embedding (LME) [24] is a novel method for learning a generative model of music playlists where playlists are treated as markov chains. The method automatically embed songs in Euclidean space and use existing playlists as training data to learn the representation of every songs in this space. LME is more accurate than the smoothed n-gram models, but it outputs the same transition for all users, ignoring personalized behavior.  Personalized Markov Embedding (PME) [25] is a next-song recommendation model which extends the existing LME algorithm by distinguishing actions of different users. It embeds both songs and users into a Euclidean space in which distances between songs and users reflect user’s preference for songs. By combining short-term and long-term preferences, PME outperforms non-personalized LME. But it ignores temporal dynamics in user preference. These four models given above are Markov methods. Among them, Bigram and PBigram make prediction directly from the

probability statistics of transitions, but LME and PME are Markov Embedding-based methods. JAMA5 is an open matrix package for Java, developed at NIST (short for National Institute of Standards and Technology) and the University of Maryland. It provides the fundamental operations of numerical linear algebra, such as matrix addition and multiplication, matrix norms and selected element-by-element array operations, etc. All algorithms in our experiments are implemented using this library. 5.4. Parameter settings In this section, we analyze how the changes of the parameters affect the performance of our model. We aim to find the optimal model by applying different parameter configurations on the dataset. Regularization Parameters. kX ; kY , kZ and kT provide the control   of the range of values for vector xðsÞ; yðuÞ; z pu;j and yt ðuÞ in order to avoid overfitting. The parameters are tuned by cross validation through the grid f0:001; 0:01; 0:1; 1g. Table 3 shows the results of different parameter combinations. We observe that when using the second row setting, our model has best result for all the metrics. Dimension of the embedded space. As the dimension of vec  tors xðsÞ; yðuÞ; z pu;j and yt ðuÞ increases, the complexity of our model will certainly increase. We conduct experiments with d ranging from 1 to 20. The results of different values of d are shown in Fig. 3. We observe that the performance continues to improve with an increasing of d, but when d > 10, our model gets small performance improvement. Based on a balance between recommendation accuracy and computation cost, we choose 10 as the dimension of the embedded space in our experiments. Number of iterations. The stochastic gradient descent method based on iteration computation is time-consuming because each iteration needs to calculate gradient and update parameter. Too many times of iterations cannot necessarily bring significant performance improvement, and even there is a high probability that the overfitting problem appears in our model when increasing the number of iterations. Experiments on different iterations are shown in Fig. 4. We observe that Recall@10, Precision@10, F 1 score@10 and MAP@10 decrease gradually with the number of iterations increasing. It is better to run 60 iterations because more than 60 iterations incur higher computational overhead without the big performance improvement in return. 5.5. Impact of parameters T B and kB In Eq. (9), an additive bias Bs ðtÞ to every song s expends the short-term effect to a time changing effect and kB controls the influence of Bs ðtÞ on short-term effect. T B in Bs ðtÞ (Eq. (8)) measures the time distance before current time t, during the period of which we define song present popularity. We then select the optimal values assigned to parameters T B and kB by experiments. The parameter T B is varied from 1 day to 9 day. For each T B , kB is select from the set {1, 0.5, 0.1, 0.05}. From Fig. 5, we observe that as T B increases, the performance improves gradually, but when T B surpasses 7 day, the performance begins to decrease. This phe5

http://math.nist.gov/javanumerics/jama/.

140

K. Ji et al. / Knowledge-Based Systems 88 (2015) 134–143

Table 3 Regularization parameters kX ; kY ; kZ and kT on Last.fm data (Top-10 recommendation, 10 dimension and 60 iterations). Our model has the best results with the second row setting. kX

kY

kZ

kT

Recall@10

Precision@10

F 1 -score@10

MAP@10

0.01 0.1 1 1

0.01 0.1 1 1

0.001 0.01 0.1 1

0.001 0.01 0.1 1

0.3421 0.3451 0.3403 0.3886

0.0318 0.0333 0.0315 0.0309

0.0582 0.0607 0.0576 0.0572

0.0952 0.0966 0.0945 0.0933

Fig. 3. Four metrics {Recall@10, Precision@10, F 1 -score@10 and MAP@10} vs Dimension (Top-10 recommendation, 60 iterations).

Fig. 4. Four metrics {Recall@10, Precision@10, F 1 -score@10 and MAP@10} vs Iteration Number (Top-10 recommendation, 10 dimension).

Fig. 5. Impact of different values of T B on song popularity. We use different kB to control the influence of Bs ðtÞ (Top-10 recommendation, 10 dimension).

nomenon suggests that if T B is too small, Bt ðsÞ is not sufficient to represent song popularity, and if T B is too big, some previous noise will be introduced to weaken the temporal dynamics. Besides, when kB ¼ 0:1, our model performs best under the four metrics.

metrics increase at first, but when T U goes below a certain treshold, i.e., 2 month, the metrics decrease gradually. Our model performs best when kU ¼ 0:1. 5.7. Performance comparison

5.6. Impact of parameters T U and kU To model temporal dynamics in the long-term effect, we introduce user drifting interest yt ðuÞ to Prðsb juÞ (Eq. (3)) which is extended into Prðsb ju; tÞ (Eq. (7)). T U controls the time distance before a transition Prðsb jsa ; tÞ and kU controls how much our model should incorporate the temporal effect. Let us consider the impact of T U and kU on user’s long-term effect. We therefore conduct the following experiments. Similar to the experiments in Section 5.5, we vary the parameter T U from 1 month to 5 month. For each T U ; kU is select from the set {1, 0.1, 0.01, 0.001}. Fig. 6 compares the performance of our model for different T U . We observe that the impact of T U generally shares the similar trend as the impact of T B . As T U increase, four

After the parameter setting and impact analysis, we have the optimal TME. In the next experiments, we compare our model with other state-of-the-art methods. Comparison experiments on Top  k recommendation are shown in Fig. 7. Results show that as we increase the number of the recommended songs, Recall increases and Precision, F 1 -score and MAP decrease. When k > 10, the curve of four metrics tend to flatten out and the advantage of our model is much more obvious. In order to provide a direct comparison, the ranking accuracy of all the models on Top  10 recommendation is listed in Fig. 8. For Recall at 10, our model increases the accuracy by 19.08%, 16.31%, 7.07% and 2.92% over Bigram, PBigram, LME and PME. For

K. Ji et al. / Knowledge-Based Systems 88 (2015) 134–143

141

Fig. 6. Impact of different values of T U on user drifting preference. We use different kU to control the influence of the drifting preference yt ðuÞ (Top-10 recommendation, 10 dimension).

Fig. 7. Top-k comparisons of Bigram, PBigram, LME, PME and TME in 10 dimension with k 2 ½1; 20.

Fig. 8. Performance comparison on Top-10 Recommendations.

Precision at 10, the corresponding improvements are 27.58%, 21.53%, 14.82% and 7.76%. For F1-score at 10, the corresponding improvements are 26.72%, 21.15%, 14.09% and 7.43%. For MAP at 10, the corresponding improvements are 18.38%, 15.27%, 11.29% and 7.81%. Finally, we evaluate how our model performs for different sessions. All sessions are grouped into 5 classes: ‘‘1–10”, ‘‘10–20”, ‘‘20–30”, ‘‘30–40” and ‘‘P40”, denoting how many songs a session

has in the training data. Fig. 9 summarize the comparison results. We observe that our model performs better than other methods for all the classes. Note that the four baselines have remained relatively unchanged over the four classes, but the prediction of our model becomes more accurate with the number of songs in the session increasing. This phenomenon demonstrates that, in one session, with an increasing number of songs listened by user, our session   drifting interest y0 pu;j can better capture user’s present interest.

142

K. Ji et al. / Knowledge-Based Systems 88 (2015) 134–143

Fig. 9. Performance comparison on different sessions (Top-10 Recommendations).

  @l Dpu;c ;sa ;sb

6. Conclusions In this paper, we presented Time-based Markov Embedding (TME), a dynamic model for next-song music recommendation which can track the changes of user interest over time. Our data analysis showed that some temporal dynamics exist in user, song and session respectively. In order to model the temporal dynamics, we take the long-term and short-term effects as well as sessionterm effect into account. Our approach is to transform the three effects into time changing effects by adding some time factor terms. Then the three time changing effects are joined up to identify user’s present interest. Experiments on Last.fm data demonstrate that our TME can significantly increase the recommendation accuracy of the state-of-the-art Markov methods. As for the future work, we would like to extend our model to capture more complicated temporal information. In addition, we also plan to conduct online A/B-tests based experiments to investigate the effect of temporal dynamics on recommending songs when a user is browsing a music site.

@yðuÞ

Dðu; sb Þ ¼ 2~ P j2S



  2 2 t  2~ D u; sj eDðu;sj Þ kU Dðu ;sj Þ  2kY yðuÞ P Dðu;s Þ2 k Dðut ;s Þ2 U j j j2S e

  @l Dpu;c ;sa ;sb     ¼ 2~ D pu;c ; sb @z pu;c   Dðp ;s Þ2 k Dðu0 ;s Þ2 P ~ S u;c j j j2S  2D pu;c ; sj e  P Dðp ;s Þ2 k Dðu0 ;s Þ2 S u;c j j j2S e    2kZ z pu;c   @l Dpu;c ;sa ;sb @yt ðuÞ

ðA:4Þ

  D ut ; sb ¼ 2kU ~ P 

Acknowledgements

ðA:3Þ

j2S

  2 2 t  2~ D ut ; sj eDðu;sj Þ kU Dðu ;sj Þ P Dðu;s Þ2 k Dðut ;s Þ2 U j j j2S e

 2kT yt ðuÞ

ðA:5Þ

This work is supported by the National Natural Science Foundation of China under Grant (No. 61472164) and the Technology Development Program of Shandong Province under Grant (No. 2011GGX10116).

where ~ Dðsa ; sb Þ denotes the vector xðsa Þ  xðsb Þ; ~ Dðu; sÞ denotes the     D pu;c ; s denotes the vector z pu;c  xðsÞ and vector yðuÞ  xðsÞ; ~ ~ Dðut ; sÞ denotes the vector y ðuÞ  xðsÞ.

Appendix A

References 

 Given one ‘‘local” log-likelihood term l Dpu;c ;sa ;sb , we get the following derivatives of the term:

  @l Dpu;c ;sa ;sb @xðsa Þ

Dðsa ; sb Þ  ¼ 2~

P j2S

  D ðs ;s Þ2 þk B ðtÞ  2~ D sa ; sj e 2 a j B sj P Dðsa ;sj Þ2 þkB Bs ðtÞ j j2S e

 2kX xðsa Þ   @l Dpu;c ;sa ;sb @xðsb Þ

ðA:1Þ

  Dðu; sb Þ þ 2kU ~ D u t ; sb ¼ 2~ h i 2 2 t 2 ~ Dðu; sb Þ þ kU ~ Dðut ; sb Þ eDðu;sb Þ kU Dðu ;sb Þ  P Dðu;sj Þ2 kU Dðut ;sj Þ2 j2S e þ 2~ Dð s a ; s b Þ 

2 2~ Dðsa ; sb ÞeD2 ðsa ;sb Þ þkB Bsb ðtÞ P Dðsa ;sj Þ2 þkB Bs ðtÞ j j2S e

  þ 2~ D pu;j ; sb þ 2kS ~ Dðu0 ; sb Þ h  i  2 2 0 2 ~ D pu;c ; sb þ kS ~ Dðu0 ; sb Þ eDðpu;c ;sb Þ kS Dðu ;sb Þ  P Dðpu;c ;sj Þ2 kS Dðu0 ;sj Þ2 j2S e  2kX xðsb Þ ðA:2Þ

t

[1] B. McFee, T. Bertin-Mahieux, D.P. Ellis, G.R. Lanckriet, The million song dataset challenge, in: Proceedings of the 21st International Conference Companion on World Wide Web, WWW ’12 Companion, ACM, New York, NY, USA, 2012, pp. 909–916. doi:http://dx.doi.org/10.1145/2187980.2188222, URL http://doi. acm.org/10.1145/2187980.2188222. [2] G. Adomavicius, A. Tuzhilin, Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions, IEEE Trans. Knowl. Data Eng. 17 (6) (2005) 734–749. [3] O. Celma, Music Recommendation and Discovery: The Long Tail, Long Fail, and Long Play in the Digital Music Space, Springer, 2010. [4] J.-J. Aucouturier, F. Pachet, Music similarity measures: what’s the use? in: ISMIR, 2002. [5] A. Berenzweig, B. Logan, D.P.W. Ellis, B. Whitman, A large-scale evalutation of acoustic and subjective music similarity measures, in: ISMIR, 2003. [6] M. Casey, R. Veltkamp, M. Goto, M. Leman, C. Rhodes, M. Slaney, Content-based music information retrieval: current directions and future challenges, Proc. IEEE 96 (4) (2008) 668–696, http://dx.doi.org/10.1109/JPROC.2008.916370. [7] G. Adomavicius, A. Tuzhilin, Context-aware recommender systems, in: Recommender Systems Handbook, Springer, 2011, pp. 217–253. [8] X. Wang, D. Rosenblum, Y. Wang, Context-aware mobile music recommendation for daily activities, in: Proceedings of the 20th ACM International Conference on Multimedia, MM ’12, ACM, New York, NY, USA, 2012, pp. 99–108. doi:http://dx.doi.org/10.1145/2393347.2393368, http://doi. acm.org/10.1145/2393347.2393368. [9] P. Knees, M. Schedl, A survey of music similarity and recommendation from music context data, ACM Trans. Multimedia Comput. Commun. Appl. 10 (1) (2013) 2:1–2:21, http://dx.doi.org/10.1145/2542205.2542206. http://doi.acm. org/10.1145/2542205.2542206. [10] Y. Koren, Factorization meets the neighborhood: a multifaceted collaborative filtering model, in: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’08, ACM, New York, NY, USA, 2008, pp. 426–434. doi:http://dx.doi.org/10.1145/1401890. 1401944, http://doi.acm.org/10.1145/1401890.1401944.

K. Ji et al. / Knowledge-Based Systems 88 (2015) 134–143 [11] L. Barrington, R. Oda, G.R.G. Lanckriet, Smarter than genius? human evaluation of music recommender systems, in: K. Hirata, G. Tzanetakis, K. Yoshii (Eds.), ISMIR, International Society for Music Information Retrieval, 2009, pp. 357– 362. [12] X. Su, T.M. Khoshgoftaar, A survey of collaborative filtering techniques, Adv. Artif. Intell. 2009 (2009), http://dx.doi.org/10.1155/2009/421425. pp. 4:2–4:2, http://dx.doi.org/10.1155/2009/421425. [13] N. Koenigstein, G. Dror, Y. Koren, Yahoo! music recommendations: modeling music ratings with temporal dynamics and item taxonomy, in: Proceedings of the Fifth ACM Conference on Recommender Systems, RecSys ’11, ACM, New York, NY, USA, 2011, pp. 165–172. doi:http://dx.doi.org/10.1145/2043932. 2043964, http://doi.acm.org/10.1145/2043932.2043964. [14] D. Yang, T. Chen, W. Zhang, Q. Lu, Y. Yu, Local implicit feedback mining for music recommendation, in: Proceedings of the Sixth ACM Conference on Recommender Systems, RecSys ’12, ACM, New York, NY, USA, 2012, pp. 91–98. doi:http://dx.doi.org/10.1145/2365952.2365973, http://doi.acm.org/10.1145/ 2365952.2365973. [15] M. Schedl, D. Schnitzer, Hybrid retrieval approaches to geospatial music recommendation, in: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’13, ACM, New York, NY, USA, 2013, pp. 793–796. doi:http://dx.doi.org/10. 1145/2484028.2484146, http://doi.acm.org/10.1145/2484028.2484146. [16] D. Bugaychenko, A. Dzuba, Musical recommendations and personalization in a social network, in: Proceedings of the 7th ACM Conference on Recommender Systems, RecSys ’13, ACM, New York, NY, USA, 2013, pp. 367–370. doi:http:// dx.doi.org/10.1145/2507157.2507192, http://doi.acm.org/10.1145/2507157. 2507192. [17] Z. Hyung, K. Lee, K. Lee, Music recommendation using text analysis on song requests to radio stations, Expert Syst. Appl. 41 (5) (2014) 2608–2618. doi: http://dx.doi.org/10.1016/j.eswa.2013.10.035, http:// www.sciencedirect.com/science/article/pii/S095741741300852X. [18] M. Schedl, D. Schnitzer, Location-aware music artist recommendation, in: MultiMedia Modeling, Lecture Notes in Computer Science, vol. 8326, Springer International Publishing, 2014, pp. 205–213. [19] S. Sasaki, T. Hirai, H. Ohya, S. Morishima, Affective music recommendation system using input images, in: ACM SIGGRAPH 2013 Posters, SIGGRAPH ’13, ACM, New York, NY, USA, 2013. pp. 90:1–90:1, doi:http://dx.doi.org/10.1145/ 2503385.2503484, http://doi.acm.org/10.1145/2503385.2503484. [20] L. Shou, K. Mao, X. Luo, K. Chen, G. Chen, T. Hu, Competence-based song recommendation, in: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’13, ACM, New York, NY, USA, 2013, pp. 423–432. doi:http://dx.doi.org/10. 1145/2484028.2484048, http://doi.acm.org/10.1145/2484028.2484048.

143

[21] P.F. Brown, P.V. deSouza, R.L. Mercer, V.J.D. Pietra, J.C. Lai, Class-based n-gram models of natural language, Comput. Linguist. 18 (4) (1992) 467–479. http:// dl.acm.org/citation.cfm?id=176313.176316. [22] B. McFee, G.R.G. Lanckriet, The natural language of playlists, in: ISMIR, 2011, pp. 537–542. [23] N. Hariri, B. Mobasher, R. Burke, Context-aware music recommendation based on latenttopic sequential patterns, in: Proceedings of the Sixth ACM Conference on Recommender Systems, RecSys ’12, ACM, New York, NY, USA, 2012, pp. 131–138. doi:http://dx.doi.org/10.1145/2365952.2365979, http://doi.acm.org/10.1145/2365952.2365979. [24] S. Chen, J.L. Moore, D. Turnbull, T. Joachims, Playlist prediction via metric embedding, in: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, ACM, New York, NY, USA, 2012, pp. 714–722. doi:http://dx.doi.org/10.1145/2339530.2339643, http://doi.acm.org/10.1145/2339530.2339643. [25] X. Wu, Q. Liu, E. Chen, L. He, J. Lv, C. Cao, G. Hu, Personalized next-song recommendation in online karaokes, in: Proceedings of the 7th ACM Conference on Recommender Systems, RecSys ’13, ACM, New York, NY, USA, 2013, pp. 137–140. doi:http://dx.doi.org/10.1145/2507157.2507215, http://doi.acm.org/10.1145/2507157.2507215. [26] J. Schlimmer, R. Granger, Beyond incremental processing: tracking concept drift, in: T. Kehler, S. Rosenschein (Eds.), Proceedings of the Fifth National Conference on Artificial Intelligence, Morgan Kaufman, San Francisco, CA, 1986, pp. 502–507. http://www.mpi-sb.mpg.de/services/library/ proceedings/contents/aaai86-1.html. [27] G. Widmer, M. Kubat, Learning in the presence of concept drift and hidden contexts, Machine Learning 23 (1) (1996) 69–101, http://dx.doi.org/10.1007/ BF00116900. http://dx.doi.org/10.1007/BF00116900. [28] J. Wang, A.P. de Vries, M.J.T. Reinders, Unifying user-based and item-based collaborative filtering approaches by similarity fusion, in: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’06, ACM, New York, NY, USA, 2006, pp. 501–508. doi:http://dx.doi.org/10.1145/1148170.1148257, http://doi.acm.org/10.1145/1148170.1148257. [29] R. Salakhutdinov, A. Mnih, Probabilistic matrix factorization, in: J. Platt, D. Koller, Y. Singer, S. Roweis (Eds.), Advances in Neural Information Processing Systems 20, MIT Press, Cambridge, MA, 2008, pp. 1257–1264. [30] Y. Koren, Collaborative filtering with temporal dynamics, in: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, ACM, New York, NY, USA, 2009, pp. 447–456. doi:http:// dx.doi.org/10.1145/1557019.1557072, http://doi.acm.org/10.1145/1557019. 1557072.