Accepted Manuscript
Fast Recommendation on Latent Collaborative Relations Chien-Liang Liu, Xuan-Wei Wu PII: DOI: Reference:
S0950-7051(16)30181-2 10.1016/j.knosys.2016.06.016 KNOSYS 3571
To appear in:
Knowledge-Based Systems
Received date: Revised date: Accepted date:
28 January 2016 1 June 2016 13 June 2016
Please cite this article as: Chien-Liang Liu, Xuan-Wei Wu, Fast Recommendation on Latent Collaborative Relations, Knowledge-Based Systems (2016), doi: 10.1016/j.knosys.2016.06.016
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlights • Devise a recommendation algorithm based on latent factor model. • combines latent factors and `2 norm to formulate the recommendation problem as a k-nearest-neighbor problem, in which we further use localitysensitive hashing (LSH) to reduce search time complexity.
AC
CE
PT
ED
M
AN US
CR IP T
• Speedup the retrieval by 5X-313X on three data sets used in the experiments.
1
ACCEPTED MANUSCRIPT
Fast Recommendation on Latent Collaborative Relations Chien-Liang Liua,1,∗, Xuan-Wei Wub a
CR IP T
Department of Industrial Engineering and Management, National Chiao Tung University, 1001 University Road, Hsinchu 300, Taiwan (R.O.C.) b Department of Computer Science, National Chiao Tung University,1001 University Road, Hsinchu 300, Taiwan (R.O.C.)
Abstract
ED
M
AN US
One important property of collaborative filtering recommender systems is that popular items are recommended disproportionately often because they provide extensive usage data and, thus, can be recommended to more users. Compared to popular products, the niches can be as economically attractive as mainstream fare for online retailers. The online retailers can stock virtually everything, and the number of available niche products exceeds the hits by several orders of magnitude. This work addresses accuracy, coverage and prediction time issues to propose a novel latent factor model called latent collaborative relations (LCR), which transforms the recommendation problem into a nearest neighbor search problem by using the proposed scoring function. We project users and items to the latent space, and calculate their similarities based on Euclidean metric. Additionally, the proposed model provides an elegant way to incorporate with locality-sensitive hashing (LSH) to provide a fast recommendation while retaining recommendation accuracy and coverage. The experimental results indicate that the speedup is significant, especially when one is confronted with largescale data sets. As for recommendation accuracy and coverage, the proposed method is competitive on three data sets.
CE
PT
Keywords: Recommender Systems, Latent Factor Model, Locality-sensitive Hashing, Nearest Neighbors
1. Introduction
AC
With the rise of the Internet, the last decade has witnessed the great success of recommender systems. The recommender systems serve two important functions. First, they can significantly help users find relevant and interesting items in the information era. Second, they help the businesses make more profits, since recommender systems provide more chances for the niche items to be bought by customers. There has been much work done both in the industry and academia on developing new approaches to recommender systems. The recommender systems can significantly help users ∗
I am corresponding author Email addresses:
[email protected] (Chien-Liang Liu),
[email protected] (Xuan-Wei Wu)
Preprint submitted to Knowledge-based Systems
June 18, 2016
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
find relevant and interesting items in the information era, so they have been applied to many application domains, including, but not limited to, telecom businesses [1], e-government services [2, 3] and online shopping [4]. The collaborative filtering (CF) recommender system is considered to be one of the most successful approaches, and many CF algorithms have been proposed over the last decades [5, 6, 7]. It recommends to the active user the items that other users with similar tastes liked in the past. However, while majority of algorithms proposed in recommender systems literature have focused on improving recommendation accuracy, other important aspects of recommendation quality, such as the coverage of recommendations, have been overlooked [8, 9]. One important property of CF recommender systems is that popular items are recommended disproportionately often because they provide extensive usage data and, thus, can be recommended to more users [10, 11, 12]. Traditional retail tends to stock the popular products, since shelf space is expensive. Compared to popular products, the niches can be as economically attractive as mainstream fare for online retailers. The online retailers can stock virtually everything, and the number of available niche products exceeds the hits by several orders of magnitude. Coverage concerns the degree to which recommendations cover the set of available items and the degree to which recommendations can be generated to all potential users. A recommender system with high coverage presents to the end user a more detailed investigation of the product space and benefits from long tail phenomenon, which has gained popularity in recent times as describing the retailing strategy of selling a large number of unique items with relatively small quantities [13, 14]. Therefore, coverage is another indicator of quality. Recommendation accuracy alone is insufficient to satisfy user expectations, since other characteristics, such as coverage, novelty, serendipity and trust must be considered as well [8, 9]. Besides, most academic recommender systems are based on offline methods that require costly recalculations over all the observed data [15], but critical problems concerning the computational cost emerge in practical situations. Building large-scale recommender systems need to consider other practical issues such as prediction time or scalability. For example, real world online recommender systems must adhere to strict response-time constraints. Consequently, the motivation of this work is to devise a fast recommendation algorithm, which considers recommendation accuracy and coverage in the model. This work focuses on training to optimize retrieval for the top k items and addresses accuracy, coverage and prediction time issues to propose a novel algorithm called latent collaborative relations (LCR). Latent factor models have become popular in recently years [16, 17, 18], and most previous latent factor models use inner product as metric to calculate item similarities. Then, one can retrieve the most suitable items by scoring and sorting all items. However, when the number of items is enormous, scoring all items becomes intractable. Several research studies used approximate matches to speed up the retrieval. For example, Bachrach et al. [19] proposed an order preserving transformation to map the maximum inner product search problem to Euclidean space nearest neighbor search problem. Then, the item with the smallest Euclidean distance in the transformed space is retrieved based on the nearest neighbor search with PCA-Tree [20] data structure. Koenigstein et al. [21] proposed to use spherical clustering to index the users on the basis of their preferences, 3
ACCEPTED MANUSCRIPT
2. Related Work
ED
M
AN US
CR IP T
and pre-compute recommendations only for the representative user of each cluster to obtain extremely efficient approximate solutions. This work transforms the recommendation problem into a nearest neighbor search with `2 norm problem, and integrates multi-probe locality sensitive hashing (LSH) [22] into the model to perform approximate search, which can speedup the retrieval by 5X-313X on three data sets used in the experiments. The contributions and innovation of this work are listed as follows. First, this work combines latent factors and `2 norm to formulate the recommendation problem as a knearest-neighbor (kNN) problem, and proposes a novel recommendation algorithm, which uses locality-sensitive hashing (LSH) [23] to reduce search time complexity. The prediction guarantees sub-linear time retrieval with a bounded approximation error. Second, this work uses Euclidean distance to measure the similarity of two vectors, and we empirically show that calculating the distance in Euclidean space achieves a higher coverage than those using inner products. Finally, the proposed algorithm can incorporate various features into the model to attain a higher recommendation accuracy and tackle cold-start problems. In summary, this work considers recommendation accuracy and practical deployment issues to devise a novel algorithm, which gives competitive performance while considering prediction time and long tail issues. In the experiments, we compare the proposed algorithm with several algorithms on three data sets. The experimental results indicate that the proposed algorithm is competitive on recommendation accuracy and coverage. The rest of this paper is organized as follows. Section 2 presents related surveys of latent factor models and nearest neighbor search. Section 3 introduces the proposed latent collaborative relations algorithm. Next, Section 4 summarizes the results of several experiments. Conclusions are finally drawn in Section 5.
PT
The proposed method is based on latent factors and nearest neighbor search, so the literature survey focuses on these two research topics.
AC
CE
2.1. Latent Factor Models The CF is perhaps one of the most well-studied recommendation approach, since it predicts the utility of items for a particular user based on the items previously rated by other users [24]. The two more successful approaches to CF are neighborhood models and latent factor models [5]. The neighborhood models comprise user-based [25] and item-based approaches [26, 4, 27], in which a rating is estimated based on the relationships between users and items, respectively. Koren [16] demonstrated that using low rank matrix factorization, which is a latent factor model, is superior to classic nearest-neighbor techniques for product recommendations. The success of Netflix competition makes matrix factorization [16] one of the most popular collaborative filtering methods. The idea behind matrix factorization models is to map users and items to a latent space, and calculate the similarities with inner products. Latent factor models have become popular in recently years, since they can model interactions between variables and obtain good scalability and high predictive accuracy even in problems 4
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
with huge sparsity [17]. The success of MF inspires many research studies to use latent factors to devise algorithms. Rendle combined the advantages of support vector machines (SVM) with factorization models to propose an algorithm called factorization machines (FM) [17, 18]. Chen et al. [28] observed that the majority of matrix factorization models share common patterns, so they proposed a framework called feature-based matrix factorization, which can incorporate various features into the model. Root mean squared error (RMSE) is probably the most popular metric used in evaluating accuracy of predicted ratings. However, recent research studies have shown that ranking metric is a better evaluation metric [29, 30, 31], since most recommendation applications present to the user a list of recommendations, typically as vertical or horizontal list, imposing a certain natural browsing order. Therefore, recent work on recommendation tended to use ranking approach to devise algorithms. Rendle et al. [30] proposed a pairwise ranking algorithm called Bayesian Personalized Ranking (BPR) that is the maximum posterior estimator derived from a Bayesian analysis of the problem. The proposed method also uses maximum posterior estimator, but several differences exist between the two methods. First, the purposes of the two methods are totally different. The BPR focuses on presenting a generic optimization criterion for personalized ranking, while the proposed method addresses accuracy, coverage and prediction time issues to transform the recommendation problem into a nearest neighbor search problem by using the proposed scoring function. Second, the BPR considers to use the user-specific order of two items as the training data of pairwise ranking, while the proposed method uses the global ordering of two instances. Finally, the proposed method can incorporate LSH into the proposed model to provide a fast recommendation. Gantner et al. [32] proposed an adapted version of BPR to focus on a specific problem in which the candidate items are sampled from a given distribution. The adapted BPR takes the non-uniform sampling of negative test items into account rather than uniform sampling as used by the BPR. Liu et al. [33] proposed an algorithm called list-wise probabilistic matrix factorization (ListPMF), which maximizes the log-posterior over the predicted preference order with the observed preference orders. Recommendation data can be represented as relational data, which comprises entities and relations between them. In movie recommendation problem, the entities involve users and movies, while relations encode users’ ratings of movies. Relational learning is concerned with predicting unknown values of a relation. Singh and Gordon [34] focused on relational learning and proposed a collective matrix factorization algorithm, which simultaneously factorizes several matrices and shares parameters among factors. The advent and popularity of social network inspires many research studies to consider social factors into recommender systems [35, 36]. For example, Qian et al. [35] considered to fuse three social factors, including personal interest, inter-personal interest similarity, and inter-personal influence, into a unified personalized recommendation model based on probabilistic matrix factorization. Meanwhile, Jiang et al. [36] fused individual preference and interpersonal influence in latent space.
5
ACCEPTED MANUSCRIPT
ED
M
AN US
CR IP T
2.2. Nearest Neighbor Search Nearest neighbor search from massive data sets is a fundamental problem in data mining, and this problem arises in numerous application fields such as computer vision, plagiarism detection, and recommender systems. Given a set S of points in a space M and a query point q ∈ M , the goal is to find the closest point in S to q. A generalization of this problem is to find the k closest points, namely kNN search. A naive approach is to use linear search to compute the distance from the query point to every other points, resulting in a running time of O(N d), where N is the cardinality of S and d is the dimensionality of M . This is very inefficient for massive data sets, so many research studies have been developed to reduce time complexity for nearest neighbor search over the past decades. Several research studies tackle this problem by designing efficient data structures to find exact nearest neighbors [37, 38, 39]. The k-d tree [37] and its variants are probably the most popular data structures used for searching in multidimensional spaces. A review of nearest neighbor data structures from the database perspective can refer to [40]. However, these algorithms become inefficient as the dimension grows. Recently, approximate nearest neighbor searching algorithms have become increasingly important, especially in high dimensional data sets [23, 41, 42]. In this formulation, the algorithm is allowed to return a point whose distance from the query is at most c times the distance from the query to its nearest points, in which c > 1 is called the approximation factor [43]. Locality-sensitive hashing (LSH) [23] is one of the most promising approximate algorithms, and has attracted extensive research interest in the past years. Let H be a family of hash functions, each of which maps Rd to some universe U . The family H is called locality sensitive (with proper parameters) if it satisfies the following condition [43].
PT
Definition 2.1. (Locality-sensitive Hashing) A family H is called (R, cR, P1 , P2 )-sensitive if for any two points p, q ∈ Rd . • if k p − q k≤ R, then P [h(q) = h(p)] ≥ P1
CE
• if k p − q k≥ cR, then P [h(q) = h(p)] ≤ P2
AC
To use LSH for approximate nearest search, we pick c > 1 and P1 > P2 . The key idea of LSH is to hash the points using several hash functions to ensure that similar items are mapped to the same buckets with high probability. Given a query point q, one can determine near neighbors by hashing q and retrieving the points stored in buckets containing q. It is noted that the hashing used in LSH is different from conventional hash functions, which avoid collisions for similar items. Several LSH families for different distance functions have been discovered, including Hamming distance, Jaccard, `1 and `2 distance [23]. Datar et al. [44] further proposed LSH families for lp norms based on p-stable distribution [45]. The LSH requires enormous hash tables to achieve good search quality. Lv et al. [22] proposed a new indexing scheme called 6
ACCEPTED MANUSCRIPT
CR IP T
multi-probe LSH to overcome this problem by intelligently probing multiple buckets that are likely to contain query results in a hash table. Many applications rely on measuring similarity between data points, and hashing technique has been widely adopted for approximate nearest neighbor search on large-scale datasets. For example, the similarity learning is concerned with binary relationship in many application settings, namely two data points are similar or dissimilar, and this problem could be addressed by similarity-sensitive hashing [46, 47]. A typical application is content-based image retrieval, whose goal is to find the images in the database similar to the query image. In multimedia domain, an object often appears in several different views. Ding et al. [48] focused on multi-view problems and proposed a hashing algorithm called Collective Matrix Factorization Hashing (CMFH), which assumes that each view of one instance generates identical hash codes. 3. Latent Collaborative Relations
M
AN US
In collaborative filtering setting, there exist users U = {u(1) , . . . , u(|U |) }, items I = {i(1) , . . . , i(|I|) }, and data instances are represented as input features X = {x = (u, i) |u ∈ U, i ∈ I} and their corresponding labels Y = {yui }. This work focuses on implicit information, namely yui ∈ {0, 1} is one for the positive class (e.g., clicked, watched movies, played songs, purchased or assigned tags) or zero for negative class. The proposed method also provides the flexibility for the model to use various features, including collaborative filtering and content-based features.
AC
CE
PT
ED
3.1. Feature Representation As for content-based features, we introduce α ∈ A(u) and β ∈ A(i) to represent the feature set for a user u and an item i, respectively. For example, A(u) can comprise a user profile such as gender and age, while A(i) can consist of item features such as item categories and price. The standard collaborative filtering setting is a special case of the proposed model when A(u) only contains the user id, and A(i) only comprises the item id. Moreover, this work introduces xν to represent the value of the user (or item) feature ν, and this work uses different schemes to present xν for different cases. For the features whose values are real numbers, no change is applied to them. For a categorical feature with m categories, we use m variables to represent it, and each variable is a binary indicator. Therefore, the gender feature will be encoded as two variables, ν1 for “gender is male” and ν2 for “gender is female”, respectively. If xν1 = 0 and xν2 = 1, then the user is female. Additionally, categorical variables can belong to several categories at the same time, and we use normalization technique to encode them. For example, let ν1 and ν2 denote animation and comedy genres, respectively. If genres of this movie include animation and comedy, its xν1 and xν2 are both 0.5, and the other genre variables are zeros. Note that we normalize the values to let the sum be one. 3.2. Scoring Function The proposed method is a latent factor model, and all of the features can be mapped to a latent space with dimension ω. This work associates the user feature α with a latent factor 7
ACCEPTED MANUSCRIPT
CR IP T
vector wα ∈ Rω . The association is also applied to the item feature, namely wβ ∈ Rω is the latent factor vector for the item feature β. Hence, each user and item can be considered as a combination of latent vectors, and we use the center of mass of latent vectors to denote a user or item. Most previous research studies used inner product to calculate the similarity between user and item. However, the result of Euclidean embedding is more intuitively understandable for humans and the neighborhood structure of the unified Euclidean space allows very efficient recommendation queries [49], so this work uses Euclidean distance to measure the similarity of two vectors. We define a scoring function for a given item i with respect to a given user u as presented in Equation (1). (1)
AN US
fd (x) = fd (u, i) = −(wu − wi )T (wu − wi ) , where X 1 xα w α wu = X xν α∈A(u) ν∈A(u)
1 wi = X
ν∈A(i)
xν
X
xβ w β
β∈A(i)
CE
PT
ED
M
3.3. Pairwise Loss Function This work models the problem by using maximum a posteriori (MAP) estimator. Let θ ∈ R be a component of a certain feature’s latent factor vector, i.e., if θ is the f-component of w. , then θ is w.,f , and Θ denotes the collection of all of the components of features’ latent factor vectors, i.e., Θ = {wl,f |l ∈ A(u) ∪ A(i) and f = 1, 2, ..., ω}. Assume each parameter θ in Θ is independent to each other, and follows a Gaussian distribution N (0, λ−1 θ ). Then, the prior of Θ is p(Θ) =
Y
−θ 2 1 q e 2 λθ 2πλ−1 θ∈Θ θ
(2)
AC
Explicit feedback such as ratings is unavailable in most application settings, but it is much easier to collect implicit feedback such as clicks, view times, and purchases from websites. Therefore, this work focuses on implicit feedback. Interesting about implicit feedback systems is that only positive observations are available, and non-observed data instances are a mixture of negative feedback and missing values. It is difficult to differentiate real negative feedback from missing values, so this work simplifies the task by categorizing the data into two classes, positive and negative. For those data instances that do not belong to positive class, they are viewed as negative instances. Let X(+) denote the set of positive instances and X(-) the set of negative instances. Additionally, assume the pairs of data are independent and identically distributed (i.i.d.), then the log likelihood of Θ conditional on observed data is 8
ACCEPTED MANUSCRIPT
`(Θ) =
X
(x(+) ,x(-) )∈X2
ln p(x(+) x(-) |Θ)
(3)
CR IP T
, where X2 = X(+) × X(-) and p(x(+) x(-) |Θ) is the probability of the data instance x(+) ranked before x(-) given the model parameters Θ. In recommendation setting, the goal is to let positive examples be ranked before negative ones. Thus, we further use the scoring function as listed in Equation (1) and a sigmoid function σ to transform the pairwise relationship between positive and negative examples into a classification problem. Equation (4) shows the formula, in which the output is a probability value between zero and one. It approaches one if fd (x(+) ) is much larger than fd (x(-) ), and zero if fd (x(+) ) is much less than fd (x(-) ).
AN US
p(x(+) x(-) |Θ) = σ(fd (x(+) ) − fd (x(-) )) 1 = 1 + e−(fd (x(+) )−fd (x(-) ))
(4)
(x
ED
M
Combine the prior as listed in Equation (2) and likelihood listed in Equation (3), one can obtain the loss function as presented in Equation (5), which is the negative of the MAP estimator. X λΘ X 2 θ − ln p(x(+) x(+) |Θ) + 2 θ∈Θ 2 (+) (-) (x ,x )∈X (5) X λΘ X 2 (+) (-) = θ − ln σ(fd (x ) − fd (x )) + 2 θ∈Θ 2 (+) (-) ,x
)∈X
CE
PT
Each parameter λθ can be treated as a hyperparameter, which can be searched by a holdout method. Practically, one can assume not every parameter has its own hyperparameter. The proposed model groups parameters and lets regularization parameter λΘ ∈ R+ be shared among parameters in the group.
AC
3.4. Learning Algorithm Using batch learning algorithms such as gradient descent to minimize Equation (5) is infeasible when one is confronted with large-scale data sets. The main reason is that Equation (5) involves summing over all examples. Moreover, it is unreasonable to assume that the data set is fixed, since the data set is growing all the time. Besides computational complexity, using batch learning approaches to design recommendation algorithms generally suffers from model update problem, explaining why this work proposes to minimize the loss function by using stochastic gradient descent, which performs gradient updates for just one example once. Thus, instead of optimizing Equation (5), the objective function of this work is listed in Equation (6), where we technically sample a positive instance x(+) and a negative instance x(-) from the data set at each iteration. 9
ACCEPTED MANUSCRIPT
J(Θ) = − ln σ(fd (x(+) ) − fd (x(-) )) +
λΘ X 2 θ 2 θ∈Θ
(6)
∂ J(Θ) ∂w
AN US
w =w−η
CR IP T
With the loss function listed in Equation (6) along with the proposed scoring function, we can derive the update rules in the form of the latent factor vector w for the model as listed in Equation (7), where η is the learning rate. The partial derivative of the loss function can be further expressed as Equation (8), in which the partial derivatives of the scoring function with respect to parameters are listed in Equation (9). Finally, learning model parameters with LCR is done by looping over Equation (7).
∂ ∂ (+) f (x ) − fd (x(-) ) d ∂ ∂w ∂w J(Θ) = − + λΘ w (+) (-) ∂w 1 + efd (x )−fd (x ) −2xαX (wu − wi ) xν ν∈A(u)
(7)
(8)
M
if w is a user latent factor vector
∂ fd (x) = ∂w 2xβ (w Xu − wi ) xν
(9)
ED
if w is an item latent factor vector
PT
ν∈A(i)
AC
CE
The complexity for updating Equation (7) depends on the partial derivative of cost function J(Θ) as listed in Equation (8), which involves the scoring function fd (x) as listed in Equation (1) and partial derivatives of the scoring function fd (x) as listed in Equation (9). The time complexities for Equation (1) and Equation (9) both depend on wu , wi , and wu − wi , so the time complexity for updating Equation (7) is O(ω(|A(u)| + |A(i)|)). Besides the update rules listed above, this work further discusses implementation issues, which can make the learning algorithm more efficient. We can further analyze the partial 1 . derivative listed in Equation (8), which has a multiplicative scalar, namely fd (x(+) )−fd (x(-) ) 1+e
Here, we introduce a variable δ(x(+) , x(-) ) listed below to represent the scalar, and it comprises two properties. 1
δ(x(+) , x(-) ) =
(x(+) )−f
(-)
) d (x 1 + e fd (+) = 1 − σ(fd (x ) − fd (x(-) ))
10
ACCEPTED MANUSCRIPT
AN US
CR IP T
First, the value of δ(x(+) , x(-) ) can be further interpreted as the influence of a specific data pair instance (x(+) , x(-) ) on the update of the parameter Θ, and approaches zero if the ranking is correct and the positive instance is assigned a larger score than that of the negative instance. Conversely, it approaches one in the opposite case. Second, it is apparent that the output of δ(x(+) , x(-) ) is also a probability value. We design a sampling process which helps the learning process obtain the data pairs with good quality based on the above two properties. Instead of using a straightforward approach to draw data pairs uniformly and accepting all drawn data pairs to update model parameters, the implementation of this work samples data instances according to the value of δ(x(+) , x(-) ), namely, to draw (x(+) , x(-) ) ∈ X2 ∝ δ(x(+) , x(-) ). Note that δ(x(+) , x(-) ) is not a fixed scale value, but depends on the model parameters Θ. The proposed algorithm is presented in Algorithm 1, which comprises two parts. First, sample a pair of instances. The sampling process is listed in Line 3-4, in which we draw x(+) and x(-) from X(+) and X(-) , respectively. The initialization of the latent factor vectors are completed in Line 5-9. Then, we use the sampling technique mentioned above to obtain data pairs. Here, we introduce a random number r, which is sampled from an uniform distribution on the interval [0, 1], to determine whether accept the drawn data pair (x(+) , x(-) ). The above processes are mentioned in Line 10-12. The second part focuses on updating model parameters as listed in Line 13-15, in which we make a gradient step to minimize the pairwise loss.
AC
CE
PT
ED
M
Algorithm 1: Latent Collaborative Relations Algorithm Input: X(+) , X(-) , η, λΘ , Θ Output: Θ 1 repeat 2 repeat 3 Draw x(+) ∈ X(+) uniformly 4 Draw x(-) ∈ X(-) uniformly 5 foreach latent factor w of x(+) and x(-) do 6 if w is not initial then 7 Initialize w 8 end 9 end 10 δ(x(+) , x(-) ) = 1 − σ(fd (x(+) ) − fd (x(-) )) 11 Draw r ∼ unif(0, 1) 12 until r < δ(x(+) , x(-) ); 13 foreach latentfactor w of x(+) and x(-) do ∂ 14 w = w + η δ(x(+) , x(-) ) ∂w fd (x(+) ) − fd (x(-) ) − λΘ w 15 16
end until convergence;
11
ACCEPTED MANUSCRIPT
CE
PT
ED
M
AN US
CR IP T
3.5. Fast Approximate Recommendation Generation by Multi-Probe LSH In a ranking-based recommendation setting, traditional latent factor models require two steps to obtain the top k relevant items with respect to an active user. First, project users and items to a latent space, and then calculate their similarities in the latent space. The most commonly used similarity metric between a user vector and an item vector is inner product. Second, sort the scores to obtain the top k items. The above approach suffers from computation problem when one is confronted with a large-scale item set. This work transforms the recommendation problem into a nearest neighbor search problem. Given a user latent vector as a query point, and all the item latent factors are data points. The goal is to search for the data points that are close to the query point. Straightforward approach is to compute all the pairwise distances between the query point and data points, and sort the distances to obtain the nearest ones. It is apparent that this approach is infeasible for massive data sets. The kd-tree provides an efficient alternative to speed up the retrieval, but it is not suitable for efficiently finding the nearest neighbor in high-dimensional spaces. To provide an efficient recommendation retrieval, this work provides the flexibility to incorporate an efficient approximate nearest neighbor search algorithm called LSH into the proposed latent collaborative relations (LCR) model. Although the proposed LCR is based on latent factors, several differences exist between the proposed LCR and previous latent factor models. First, instead of using inner product, the LCR searches for the top k relevant items based on `2 norm. The LCR searches for items near the active user, and the search space is bounded by a circle. Second, the LCR can transform the ranking problem into an approximate nearest neighbor problem, in which we use LSH to efficiently locate the candidates. Third, the LCR provides the recommendations with coverage owing to the characteristics of Euclidean space, and the experiments present the results. The goal of the hashing used in LSH is to maximize probability of collision of similar items, and the performance of LSH is associated with the parameters, including the number of hash functions M , the number of hash tables L and the window size W . The LSH relies on hashing functions to reduce the candidate sizes. Given an item latent factor vector wi , we insert it into hash tables, each of which is defined as follows [44]:
ha,b (wi ) =
j aT w + b k i W
AC
, where a is a ω-dimensional random vector with entries chosen independently from a standard Gaussian distribution, which is 2-stable and works for the Euclidean distance, and b is a real number chosen uniformly from the range 0 to W . The ha,b projects the latent factor vector wi onto a with an offset b and indexed by the window size W . Intuitively, each hashing function can be viewed as a partition over the space, so a larger value of M leads to fewer latent factor vectors colliding in the same bucket, and result in lower recall but shorter query time. To solve the problem, multiple hash tables (parameterized by L) are used, that is, increasing L will enlarge the number of candidate neighbors since more buckets are included across these tables. Figure 1 illustrates the idea of LSH, in which M = 2 and L = 2. The hashing function in the left hand side hashes 12
CR IP T
ACCEPTED MANUSCRIPT
Figure 1: An illustration for LSH with latent factor vectors (L = 2, M = 2). The circles (items latent factor vectors) are hashed into their buckets independently, and the star (a query user latent factor vector) is hashed into bucket (1, 1) in the left side hash table and (0, 1) in the right side hash table.
PT
ED
M
AN US
the query data point to bucket (1,1), and the points that are hashed to bucket (1,1) are candidate neighbors. Similarly, we can use the hashing function in the right hand side to find candidate neighbors. Using the two hashing functions, the points with orange circle, purple circle, and dark blue circle are candidates. The performance of LSH relies on the parameters, and the implementation of this work is based on the offline parameter tuning technique mentioned in [50] to find the optimal values of M , W and L. Given a user latent factor vector wu , the LSH can quickly locate its buckets, and obtain the candidates by retrieving the items whose latent factor vectors are in these buckets. This work further uses multi-probe LSH [22] to intelligently probe multiple buckets that are likely to contain query results in a hash table, yielding a more efficient retrieval. The probing sequence for neighboring buckets is not randomly chosen but considers the distance to the original hashed bucket. Take Figure 1 as an example, the star point is hashed to bucket (0, 1) in the right side hash table, and we can also probe bucket (0, 2) since its margin is the most close to the query. Theoretically, the speedup of the retrieval is achieved since the number of items in these buckets is much smaller than the number of all items.
CE
4. Experiments
AC
We conduct experiments on three data sets, and compare the proposed algorithm with several algorithms. This work repeats each experiment ten times and uses the average and standard deviation of the results to present the experimental results, in which mean plus or minus two standard deviations is presented in the tables. Each evaluation randomly selects 20% of data for testing, and the remaining data for training. For Some performance metrics, such as coverage, recall@k and selectivity, they are computation intensive tasks, so the experiments randomly select 50 users in each testing set, denoted by Utest , to evaluate performance. 4.1. Data Sets This work conducts experiments on three data sets, and focuses on two-class problems, in which positive class implicitly indicates a user likes a specific item. Note that negative 13
ACCEPTED MANUSCRIPT
class does not mean that user dislikes it. In most cases, the number of positive classes is overwhelmed by negative ones, so we use sampling technique, of which the sampling probability for an item is proportional to its global popularity [51], to select partial negative samples to balance the two classes.
CR IP T
• Last.fm Data Set: This data is collected from Last.fm 1 online music system, which comprises the listening habits for nearly 2,000 users and 18,000 artists. The number of artist listening records is 93,000, and they are viewed as positive data. For the artists that are not the above case, we sample about 93,000 records as negative data. Besides user id and item id, this data set comprises 11,946 tag assignments of artists, which are also used in the experiments.
AN US
• Gowalla Data Set: Gowalla is a location-based social networking website where users share their locations by checking-in. The total number of check-ins for Gowalla data set [52] is about 6.3 million, which comprises 107,000 users and 1.3 million locations. We regard checkins logs as positive data and the locations where visitors did not check in are viewed as negative data. As mentioned above, we sample approximately 6.3 million negative data for model training. Besides user id and item id, content features for each location, such as latitude and longitude, are used as item features in the experiments.
CE
PT
ED
M
• Amazon Data Set: Amazon data set was collected by crawling Amazon website [53]. It is based on “Customers Who Bought This Item Also Bought” feature of the Amazon website. If a product i is frequently co-purchased with a product j, the graph contains a directed edge from i to j. It is apparent that this data set focuses on item-item relationship, so we view product i as a user and product j as an item. Note that user id is unavailable in this data set. This data set comprises around 335,000 products and 1 million co-purchasing records. We view the product pairs in co-purchasing records as positive data. Conversely, the product pairs that do not appear in the co-purchasing records are viewed as negative data. Similarly, we sample 1 million negative data from the negative data collection.
AC
4.2. Evaluation Metric • Area under the curve (AUC): A receiver operating characteristics (ROC) curve is a plot used to illustrate the performance of a binary classifier system as its discrimination threshold is varied. In a ROC curve the true positive rate is plotted in function of the false positive rate for different cut-off points. To compare classifiers one may want to reduce ROC performance to a single scalar value representing expected performance. The AUC provides 1
Last.fm: http://www.lastfm.com
14
ACCEPTED MANUSCRIPT
1 (+) |Xtest ||X(-) test |
X
(-) x(-)∈Xtset
X
CR IP T
an alternative in evaluating classifier performance, in which its value is between 0 and 1. Larger AUC values indicate better classifier performance across the full range of possible thresholds. Many approaches have been proposed to calculate AUC, and we use the trapezoidal rule here. The AUC has an important statistical property, namely the AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance as shown in Equation (10), where I is an indicator function. More detailed introduction about ROC and AUC can refer to [54].
I[fd (x(+) ) > fd (x(-) )]
(+) x(+)∈Xtest
(10)
AN US
• Coverage: If we use I to denote the set of available items and Rec@k(u) to denote the set of top-k ranked items for the given user u. Given a test set Utest , a measurement for prediction coverage [9] can be given by: [ Rec@k(u) u∈Utest
M
|I|
AC
CE
PT
ED
• Recall@k: The proposed method recommends the items approximately, so we use recall to evaluate the quality of the fast recommendation result. Let u be the given user, Rec@k(u) be the set generated by the original recommendation, and ApproxRec@k(u) be the set generated by approximate recommendation. Mean Recall@k over the entire test set Utest is defined as: X |Rec@k(u) ∩ ApproxRec@k(u)| |Rec@k(u)| u∈U test
|Utest |
• Selectivity: The prediction of the proposed method requires to scan through the items in the candidate set, and compute their predicted scores to obtain the top k items. It is apparent that the prediction time is thus proportional to the size of the candidate set. Dong et al. [50] define selectivity as the size ratio between the candidate set C(u) and the set of whole items I, and report its mean over the entire test set Utest . The formula is: 15
ACCEPTED MANUSCRIPT
X |C(u)| |I| u∈U test
CR IP T
|Utest |
Table 1: The Evaluation on three data sets with AUC
Gowalla Amazon 0.9846 ± 0.0002 0.8102 ± 0.0009 0.5001 ± 0.0007 0.4999 ± 0.0016 0.5019 ± 0.0002 0.4349 ± 0.0076 0.9490 ± 0.0053 0.7765 ± 0.017 0.9571 ± 0.0117 0.7240 ± 0.0027 0.9102 ± 0.0014 0.7852 ± 0.0015 0.9595 ± 0.0003 0.5176 ± 0.0021
AN US
Last.fm 0.8419 ± 0.0023 0.5012 ± 0.0061 0.5018 ± 0.0019 0.7818 ± 0.0098 0.8366 ± 0.0026 0.7876 ± 0.0042 0.7875 ± 0.0096
ED
M
Method LCR Random Most Popular FM (SGD) FM (ALS) FM (NTU) SVDFeature
Figure 3: The Evaluation of Coverage on Gowalla
PT
Figure 2: The Evaluation of Coverage on Last.fm
Figure 4: The Evaluation of Coverage on Amazon
AC
CE
4.3. Evaluation Results This work uses six algorithms to compare with the proposed algorithm. The first algorithm is called “Random”, indicating that it randomly recommend items to the user. The second algorithm is called “Most Popular”, which recommends popular items to user. The 3rd-5th algorithms are the variants of FM [55], which combines the generality of feature engineering with the superiority of factorization models in estimating interactions between categorical variables of large domain. The libFM library 2 is a software implementation for FM that features stochastic gradient descent (SGD) and alternating least-squares (ALS) optimization. We further reference the implementation used by Wu et al. [56], who modified the objective function of FM and won the first prize of KDD Cup 2012 Track 2, and abbreviate the modified FM as “FM (NTU)”. The final one is SVDFeature [28], which has 2
libFM: http://www.libfm.org/
16
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
won KDD Cup for two consecutive years. This work sets the number of factors ω as ten in the experiments. The proposed method belongs to latent factor models, and it can incorporate LSH into the model to speed up item retrieval. For the experiments in this section, we only focus on the evaluation of the proposed LCR without using LSH. The experimental results are presented in Table 1, in which we focus on recommendation accuracy. The proposed method works well and outperforms the other alternatives. It is apparent that the “Random” algorithm fails to capture user’s behaviors. Recommending popular items is a straightforward approach, but the experimental results indicate that this approach fails to perform well on the three data sets. The other algorithms are latent factor models. Compared to these latent factor models, the proposed method can benefit from the proposed scoring function, optimization over ranking loss with `2 regularization term, and yield a better performance. Note that SVDFeature fails to perform well on Amazon data set, since this data set allows the same product to play different roles, including user and item. However, SVDFeature uses conventional user-item relationship to design model, and regards the same product in different roles as two kinds of objects to map them to two different latent factor vectors. The second experiment focuses on the recommendation coverage. Figure 2-4 show the experimental results, in which horizontal bar charts represent the coverage percentages on top-10 and top-20 recommendation items. Intuitively, the “Random” can cover the largest item sets, while the item coverage of “Most Popular” should be small. The experimental results conform to the intuition. The recommendation of the proposed LCR can cover more items than the other latent factor models. The experimental results indicate that calculating the distance in Euclidean space achieve a higher coverage than those using inner products. An important property of Euclidean embedding is that users and items are embedded in a unified Euclidean space, and the interpretation of user points and item points are the same, since the maximum rating can be reached when they are at the same point [49]. Therefore, the recommendation can be viewed as a nearest neighbors search problem. In a latent space, all the dimensions should comprise latent information. For example, in movie recommendation problems, the latent information could be interpreted as movie genre. Given a user data point, the item data points that are close to the user data point should share some latent information with the user data point in common, giving a base to obtain high coverage on recommendation results. In contrast, the inner product between two vectors q and x involves the length of the two vectors and the angle between q and x vectors. Thus, given a query data point q, one has to consider the length of x vector and the angle θ to determine the top k data points when using inner product in recommendation problems. The item points whose lengths are large have a good chance to become recommendation candidates, so using inner product as similarity metric tends to recommend some specific items. 4.4. The Effect of Fast Recommendation Besides recommendation quality, prediction time is always the most important issue in building a recommender systems. If the recommender systems fails to make recommendations in near real-time, the other quality issues will fail as well. This work considers 17
ACCEPTED MANUSCRIPT
Recall 0.7 0.8
Recall 0.7
Speedup 9.9140 ± 2.5314 8.8870 ± 2.4596 6.4879 ± 2.4148 5.4196 ± 1.9357
Selectivity (%) 6.5263 ± 3.2646 7.1176 ± 3.1835 14.0341 ± 5.5439 15.9811 ± 6.4921
Table 3: The Evaluation on Fast Recommendation for Gowalla (|Utest | = 50)
Top k 10 20 10 20
Recall@k 0.7420 ± 0.1010 0.7007 ± 0.1201 0.8787 ± 0.0726 0.8480 ± 0.0681
Coverage (%) Speedup 0.0362 ± 0.0017 313.8905 ± 56.3864 0.0705 ± 0.0019 286.1822 ± 15.7925 0.0390 ± 0.0001 280.9170 ± 165.6483 0.0768 ± 0.0017 265.6217 ± 145.5929
Recall@k 0.7280 ± 0.0418 0.6760 ± 0.0727 0.8313 ± 0.0569 0.7933 ± 0.0479
Coverage (%) Speedup 0.1492 ± 0.0003 52.0737 ± 6.1795 0.2983 ± 0.0010 47.0362 ± 6.0529 0.1575 ± 0.0000 26.1614 ± 13.2057 0.3153 ± 0.0004 25.5232 ± 13.8326
AC
0.8
Top k 10 20 10 20
Recall 0.7 0.8
PT
0.7
Selectivity (%) 0.4999 ± 0.3121 0.5253 ± 0.2981 3.6175 ± 1.6453 3.6534 ± 1.6813
Table 5: The Evaluation on Fast Recommendation for Last.fm (|Utest | = 100)
Recall@k 0.6686 ± 0.0561 0.6283 ± 0.0689 0.7969 ± 0.0536 0.7907 ± 0.0449
CE
Recall
Selectivity (%) 0.0602 ± 0.0123 0.0654 ± 0.0119 0.2624 ± 0.2213 0.2732 ± 0.2205
Table 4: The Evaluation on Fast Recommendation for Amazon (|Utest | = 50)
Top k 10 20 10 20
ED
0.8
Coverage (%) 2.3915 ± 0.1982 4.4427 ± 0.0365 2.3064 ± 0.1573 4.3463 ± 0.1929
CR IP T
0.8
Recall@k 0.7073 ± 0.0201 0.6677 ± 0.0553 0.8353 ± 0.0416 0.8147 ± 0.0359
AN US
0.7
Table 2: The Evaluation on Fast Recommendation for Last.fm (|Utest | = 50)
Top k 10 20 10 20
M
Recall
Coverage (%) 3.7128 ± 0.1163 6.5689 ± 0.4600 3.7656 ± 0.1869 6.6056 ± 0.2489
Speedup 24.0299 ± 0.7505 23.5134 ± 0.9238 22.3791 ± 2.2980 21.4637 ± 2.0386
Selectivity (%) 4.6988 ±2.6224 5.4453 ± 2.9550 13.2499 ± 4.6979 14.9935 ± 4.5849
Table 6: The Evaluation on Fast Recommendation for Amazon (|Utest | = 100)
Top k 10 20 10 20
Recall@k 0.6861 ± 0.0540 0.6451 ± 0.0664 0.8046 ± 0.0415 0.7707 ± 0.0566
Coverage (%) 0.2818 ± 0.0107 0.5496 ± 0.0112 0.2856 ± 0.0133 0.5558 ± 0.0319 18
Speedup 356.5580 ± 37.0615 346.6763 ± 52.5300 344.4391 ± 92.7600 343.6052 ± 85.9118
Selectivity (%) 0.3162 ± 0.0422 0.3543 ± 0.0315 2.2513 ± 0.0952 2.3393 ± 0.1065
CR IP T
ACCEPTED MANUSCRIPT
Figure 5: The Evaluation of Content-features on Last.fm
Figure 6: The Evaluation of Content-features on Gowalla
AC
CE
PT
ED
M
AN US
the strength of latent factors and transforms the recommendation problem into a k-nearest neighbors search problem in the Euclidean space. We further use multi-probe LSH technique to accelerate nearest neighbor search. We conduct experiments to investigate the effects by using the proposed model with two schemes. The first scheme is multi-probe LSH, while the second one is brute-force search. A high value for recall@k implies that the approximate recommendations are very similar to the original recommendations. The recall and retrieval time is a trade-off in approximate search. Moreover, a significant drawback of LSH is that its performance is very sensitive to the parameters, and tuning parameters for a given data set remains a tedious process. Dong et al. [50] proposed a performance model for multi-probe LSH. Given a small sample data set, their proposed model can accurately predict the average search quality and latency. Note that the performance model requires to give a required value for recall. The experiments set the recalls as 0.7 and 0.8, respectively. The experimental results are presented in Table 2-4, which summarize the result of top-10 and top-20 recommendation. In the experiments, this work carries out each experiment ten times, but all experiments share the same parameter, which is obtained using the performance model on the first experiment. Thus, each Recall@k in Table 2-4 approaches the required value instead of not always being larger than required one. Additionally, the experimental results indicate that a higher recall leads to a lower speedup, conforming to the design of approximate search algorithm. The experimental results indicate that the proposed model incorporating with multi-probe LSH can speed up the retrieval performance, giving 5-313 times faster than the baseline. Furthermore, selectivity indicator explains the reason why the speedup can be significant, since only a small portion of the data is searched to obtain the top k items. As the data set becomes larger, the selectivity is much more lower since LSH filters out most of the items. As a consequence, the speed-up on a large-scale data set, such as Gowalla, can be significant. Note that the coverage of the fast recommendation is almost the same as that of the brute force approach listed in Figure 2-4. Furthermore, we conduct experiments on Last.fm and Amazon datasets with |Utest | = 100, and the experimental results are presented in Table 5 and Table 6. The experimental results resemble those presented in the previous experiments, in which the proposed model incorporating with multi-probe LSH can speed up the retrieval 19
ACCEPTED MANUSCRIPT
performance.
5. Conclusion and Further Study
AN US
CR IP T
4.5. The Effect of Content-features As in collaborative filtering setting, the introduction of new users and new items leads to cold-start problem, since insufficient data can be used to work accurately. Conversely, content-based algorithms are based on item information and user profile, so they can use these available content to tackle cold-start problems. In most e-commerce websites, content information, such as user basic profile and item description, is always available. As mentioned above, the proposed method provides the flexibility to incorporate various features into the model to enhance prediction quality, and prevent cold-start problems. This work conducts two experiments on Last.fm and Gowalla data sets to evaluate the effect of content-features. Note that Amazon data set does not provide additional content features, so this data set is absent from the experiments. Figure 5 and Figure 6 list the experimental results, which indicate that the proposed method can benefit from additional information. Besides, incorporating additional content features into the proposed model does not affect the results of coverage significantly.
PT
ED
M
This work considers recommendation accuracy, coverage and prediction time issues to devise a latent factor model algorithm. The main ideas behind the proposed algorithm are to transform the recommendation problems into a nearest neighbor search problem and provide the flexibility to incorporate LSH into the model to speed up the search for the top k items. The proposed model provides an elegant combination with LSH to provide a fast recommendation while retaining recommendation accuracy and coverage. Additionally, the proposed method considers the practical system deployment issue to use stochastic gradient descent algorithm along with sampling technique to optimize the ranking loss. The future work is to incorporate fast recommendation into the model, and tune the LSH parameters systematically using a deeper theoretical analysis.
CE
Acknowledgment
AC
This work was supported in part by Ministry of Science and Technology, Taiwan, under Grant No. MOST 104-2218-E-009-028. References
[1] Z. Zhang, H. Lin, K. Liu, D. Wu, G. Zhang, J. Lu, A hybrid fuzzy-based personalized recommender system for telecom products/services, Inf. Sci. 235 (2013) 117–129. [2] J. Lu, Q. Shambour, Y. Xu, Q. Lin, G. Zhang, Bizseeker: A hybrid semantic recommendation system for personalized government-to-business e-services., Internet Research 20 (3) (2010) 342–365. [3] Q. Shambour, J. Lu, A hybrid trust-enhanced collaborative filtering recommendation approach for personalized government-to-business e-services, Int. J. Intell. Syst. 26 (9) (2011) 814–843. [4] G. Linden, B. Smith, J. York, Amazon.com recommendations: Item-to-item collaborative filtering, IEEE Internet Computing 7 (1) (2003) 76–80.
20
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
[5] Y. Koren, Factorization meets the neighborhood: A multifaceted collaborative filtering model, in: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’08, ACM, New York, NY, USA, 2008, pp. 426–434. [6] H. Wang, W.-J. Li, Relational collaborative topic regression for recommender systems, Knowledge and Data Engineering, IEEE Transactions on 27 (5) (2015) 1343–1355. [7] M. Sarwat, J. J. Levandoski, A. Eldawy, M. F. Mokbel, Lars*: An efficient and scalable locationaware recommender system, Knowledge and Data Engineering, IEEE Transactions on 26 (6) (2014) 1384–1399. [8] J. L. Herlocker, J. A. Konstan, L. G. Terveen, J. T. Riedl, Evaluating collaborative filtering recommender systems, ACM Transactions on Information Systems (TOIS) 22 (1) (2004) 5–53. [9] M. Ge, C. Delgado-Battenfeld, D. Jannach, Beyond accuracy: evaluating recommender systems by coverage and serendipity, in: Proceedings of the fourth ACM conference on Recommender systems, ACM, 2010, pp. 257–260. [10] D. M. Fleder, K. Hosanagar, Recommender systems and their impact on sales diversity, in: Proceedings of the 8th ACM Conference on Electronic Commerce, EC ’07, ACM, New York, NY, USA, 2007, pp. 192–199. [11] M. Zhang, Enhancing diversity in top-n recommendation, in: Proceedings of the Third ACM Conference on Recommender Systems, RecSys ’09, ACM, New York, NY, USA, 2009, pp. 397–400. [12] K. Niemann, M. Wolpers, A new collaborative filtering approach for increasing the aggregate diversity of recommender systems, in: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, ACM, New York, NY, USA, 2013, pp. 955–963. [13] C. Anderson, The Long Tail: Why the Future of Business Is Selling Less of More, Hyperion, 2006. [14] Y.-J. Park, The adaptive clustering method for the long tail problem of recommender systems, Knowledge and Data Engineering, IEEE Transactions on 25 (8) (2013) 1904–1915. [15] K. Yoshii, M. Goto, K. Komatani, T. Ogata, H. G. Okuno, An efficient hybrid music recommender system using an incrementally trainable probabilistic generative model, IEEE Transactions on Audio, Speech & Language Processing 16 (2) (2008) 435–447. [16] Y. Koren, R. Bell, C. Volinsky, Matrix factorization techniques for recommender systems, Computer 42 (8) (2009) 30–37. [17] S. Rendle, Factorization machines, in: Proceedings of the 2010 IEEE International Conference on Data Mining, ICDM ’10, IEEE Computer Society, Washington, DC, USA, 2010, pp. 995–1000. doi:10.1109/ICDM.2010.127. URL http://dx.doi.org/10.1109/ICDM.2010.127 [18] S. Rendle, Z. Gantner, C. Freudenthaler, L. Schmidt-Thieme, Fast context-aware recommendations with factorization machines, in: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’11, ACM, New York, NY, USA, 2011, pp. 635–644. doi:10.1145/2009916.2010002. URL http://doi.acm.org/10.1145/2009916.2010002 [19] Y. Bachrach, Y. Finkelstein, R. Gilad-Bachrach, L. Katzir, N. Koenigstein, N. Nice, U. Paquet, Speeding up the xbox recommender system using a euclidean transformation for inner-product spaces, in: Proceedings of the 8th ACM Conference on Recommender Systems, RecSys ’14, ACM, New York, NY, USA, 2014, pp. 257–264. [20] R. F. Sproull, Refinements to nearest-neighbor searching ink-dimensional trees, Algorithmica 6 (1-6) (1991) 579–589. [21] N. Koenigstein, P. Ram, Y. Shavitt, Efficient retrieval of recommendations in a matrix factorization framework, in: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM ’12, ACM, New York, NY, USA, 2012, pp. 535–544. [22] Q. Lv, W. Josephson, Z. Wang, M. Charikar, K. Li, Multi-probe lsh: efficient indexing for highdimensional similarity search, in: Proceedings of the 33rd international conference on Very large data bases, VLDB Endowment, 2007, pp. 950–961. [23] P. Indyk, R. Motwani, Approximate nearest neighbors: Towards removing the curse of dimensionality,
21
ACCEPTED MANUSCRIPT
[30]
[31]
[32] [33] [34]
[35] [36] [37]
AC
[38]
CR IP T
[29]
AN US
[28]
M
[27]
ED
[26]
PT
[25]
CE
[24]
in: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, ACM, New York, NY, USA, 1998, pp. 604–613. doi:10.1145/276698.276876. URL http://doi.acm.org/10.1145/276698.276876 G. Adomavicius, A. Tuzhilin, Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions, Knowledge and Data Engineering, IEEE Transactions on 17 (6) (2005) 734–749. J. L. Herlocker, J. A. Konstan, A. Borchers, J. Riedl, An algorithmic framework for performing collaborative filtering, in: Proceedings of the 22Nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’99, ACM, New York, NY, USA, 1999, pp. 230–237. B. Sarwar, G. Karypis, J. Konstan, J. Riedl, Item-based collaborative filtering recommendation algorithms, in: Proceedings of the 10th International Conference on World Wide Web, WWW ’01, ACM, New York, NY, USA, 2001, pp. 285–295. M. Deshpande, G. Karypis, Item-based top-n recommendation algorithms, ACM Trans. Inf. Syst. 22 (1) (2004) 143–177. T. Chen, W. Zhang, Q. Lu, K. Chen, Z. Zheng, Y. Yu, Svdfeature: A toolkit for feature-based collaborative filtering, Journal of Machine Learning Research 13 (2012) 3585–3588. S. Balakrishnan, S. Chopra, Collaborative ranking, in: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, WSDM ’12, ACM, New York, NY, USA, 2012, pp. 143–152. S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt-Thieme, Bpr: Bayesian personalized ranking from implicit feedback, in: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, AUAI Press, Arlington, Virginia, United States, 2009, pp. 452–461. Y. Shi, A. Karatzoglou, L. Baltrunas, M. Larson, A. Hanjalic, N. Oliver, Tfmap: Optimizing map for top-n context-aware recommendation, in: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’12, ACM, New York, NY, USA, 2012, pp. 155–164. Z. Gantner, L. Drumond, C. Freudenthaler, L. Schmidt-Thieme, Bayesian personalized ranking for non-uniformly sampled items, JMLR W&CP, Jan. J. Liu, C. Wu, Y. Xiong, W. Liu, List-wise probabilistic matrix factorization for recommendation, Information Sciences 278 (2014) 434–447. A. P. Singh, G. J. Gordon, Relational learning via collective matrix factorization, in: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’08, ACM, New York, NY, USA, 2008, pp. 650–658. X. Qian, H. Feng, G. Zhao, T. Mei, Personalized recommendation combining user interest and social circle, Knowledge and Data Engineering, IEEE Transactions on 26 (7) (2014) 1763–1777. M. Jiang, P. Cui, F. Wang, W. Zhu, S. Yang, Scalable recommendation with social contextual information, Knowledge and Data Engineering, IEEE Transactions on 26 (11) (2014) 2789–2802. J. L. Bentley, Multidimensional binary search trees used for associative searching, Commun. ACM 18 (9) (1975) 509–517. A. Guttman, R-trees: A dynamic index structure for spatial searching, in: Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, SIGMOD ’84, ACM, New York, NY, USA, 1984, pp. 47–57. A. Beygelzimer, S. Kakade, J. Langford, Cover trees for nearest neighbor, in: Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, ACM, New York, NY, USA, 2006, pp. 97–104. C. B¨ ohm, S. Berchtold, D. A. Keim, Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases, ACM Comput. Surv. 33 (3) (2001) 322–373. E. Kushilevitz, R. Ostrovsky, Y. Rabani, Efficient search for approximate nearest neighbor in high dimensional spaces, in: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, ACM, New York, NY, USA, 1998, pp. 614–623. R. Krauthgamer, J. R. Lee, Navigating nets: Simple algorithms for proximity search, in: Proceedings of
[39]
[40] [41]
[42]
22
ACCEPTED MANUSCRIPT
[48]
[49]
[50]
[51]
[52]
[53] [54] [55]
AC
CE
[56]
CR IP T
[47]
AN US
[46]
M
[45]
ED
[44]
PT
[43]
the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’04, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2004, pp. 798–807. A. Andoni, P. Indyk, Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions, Commun. ACM 51 (1) (2008) 117–122. M. Datar, N. Immorlica, P. Indyk, V. S. Mirrokni, Locality-sensitive hashing scheme based on p-stable distributions, in: Proceedings of the Twentieth Annual Symposium on Computational Geometry, SCG ’04, ACM, New York, NY, USA, 2004, pp. 253–262. P. Indyk, Stable distributions, pseudorandom generators, embeddings, and data stream computation, J. ACM 53 (3) (2006) 307–323. Y. Weiss, A. Torralba, R. Fergus, Spectral hashing, in: D. Koller, D. Schuurmans, Y. Bengio, L. Bottou (Eds.), Advances in Neural Information Processing Systems 21, Curran Associates, Inc., 2009, pp. 1753– 1760. M. Bronstein, A. Bronstein, F. Michel, N. Paragios, Data fusion through cross-modality metric learning using similarity-sensitive hashing, in: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, 2010, pp. 3594–3601. G. Ding, Y. Guo, J. Zhou, Collective matrix factorization hashing for multimodal data, in: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’14, IEEE Computer Society, Washington, DC, USA, 2014, pp. 2083–2090. M. Khoshneshin, W. N. Street, Collaborative filtering via euclidean embedding, in: Proceedings of the Fourth ACM Conference on Recommender Systems, RecSys ’10, ACM, New York, NY, USA, 2010, pp. 87–94. W. Dong, Z. Wang, W. Josephson, M. Charikar, K. Li, Modeling lsh for performance tuning, in: Proceedings of the 17th ACM conference on Information and knowledge management, ACM, 2008, pp. 669–678. Z. Gantner, L. Drumond, C. Freudenthaler, L. Schmidt-Thieme, Personalized ranking for non-uniformly sampled items., in: G. Dror, Y. Koren, M. Weimer (Eds.), KDD Cup, Vol. 18 of JMLR Proceedings, JMLR.org, 2012, pp. 231–247. E. Cho, S. A. Myers, J. Leskovec, Friendship and mobility: user movement in location-based social networks, in: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2011, pp. 1082–1090. J. Yang, J. Leskovec, Defining and evaluating network communities based on ground-truth, in: Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics, ACM, 2012, p. 3. T. Fawcett, An introduction to roc analysis, Pattern Recogn. Lett. 27 (8) (2006) 861–874. S. Rendle, Factorization machines with libFM, ACM Trans. Intell. Syst. Technol. 3 (3) (2012) 57:1– 57:22. K.-W. Wu, C.-S. Ferng, C.-H. Ho, A.-C. Liang, C.-H. Huang, W.-Y. Shen, J.-Y. Jiang, M.-H. Yang, T.-W. Lin, C.-P. Lee, P.-H. Kung, C.-E. Wang, T.-W. Ku, C.-Y. Ho, Y.-S. Tai, I.-K. Chen, W.-L. Huang, C.-P. Chou, T.-J. Lin, H.-J. Yang, Y.-K. Wang, C.-T. Li, S.-D. Lin, H.-T. Lin, A two-stage ensemble of diverse models for advertisement ranking in KDD Cup 2012, Tech. rep., National Taiwan University, first-place winner report of KDD Cup 2012 track 2 (August 2012).
23