Accepted Manuscript
APRA: An Approximate Parallel Recommendation Algorithm for Big Data Badr Ait Hammou, Ayoub Ait Lahcen, Salma Mouline PII: DOI: Reference:
S0950-7051(18)30216-8 10.1016/j.knosys.2018.05.006 KNOSYS 4325
To appear in:
Knowledge-Based Systems
Received date: Revised date: Accepted date:
16 December 2017 25 March 2018 5 May 2018
Please cite this article as: Badr Ait Hammou, Ayoub Ait Lahcen, Salma Mouline, APRA: An Approximate Parallel Recommendation Algorithm for Big Data, Knowledge-Based Systems (2018), doi: 10.1016/j.knosys.2018.05.006
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
APRA: An Approximate Parallel Recommendation Algorithm for Big Data
CR IP T
Badr Ait Hammoua,∗, Ayoub Ait Lahcena,b , Salma Moulinea a LRIT,
Associated Unit to CNRST (URAC 29), Rabat IT Center, Faculty of Sciences, Mohammed V University, Rabat, Morocco b LGS, National School of Applied Sciences (ENSA), Ibn Tofail University, Kenitra, Morocco
AN US
Abstract
Finding relevant and interesting items according to the preferences of each user has become an important challenge in the era of Big Data. Recommender systems have emerged in response to this problem. Collaborative Filtering (CF) is one of the most successful recommender systems used by several big online shopping companies. However, CF is computationally demanding, especially
M
in Big Data context, where the number of users and items are too big to be effectively processed by traditional approaches.
ED
In this paper, we propose a new solution based on Spark, which is tailored to handle large-scale data and provide better results. Particularly, we take advantage of the in-memory operations available through Spark, to improve the
PT
performance of recommendation systems in context of Big Data. Experimental results on two real-world data sets confirm the claim. Keywords: Recommender system, Big Data, Apache Hadoop, Apache Spark,
CE
Collaborative filtering
∗ Corresponding
AC
author Email addresses:
[email protected] (Badr Ait Hammou),
[email protected] (Ayoub Ait Lahcen),
[email protected] (Salma Mouline)
Preprint submitted to Journal of LATEX Templates
May 16, 2018
ACCEPTED MANUSCRIPT
1
1. Introduction Over the last few years, Big data has become an important issue for a large
3
number of research fields. The main challenge for Big Data lies in exploring the
4
huge amount of data and extracting useful information for specific purposes [1,
5
2, 3, 4].
CR IP T
2
6
Currently, multiple websites provide millions of items to their customers.
7
However, finding relevant and interesting items can be challenging due to the
8
tremendous growth of information available.
Recommender systems (RSs) have emerged in response to this problem. A
10
recommender system is an information filtering system, which is able to filter
11
large amounts of information, and generate recommendations that will most
12
likely fit user’s needs [5].
AN US
9
Collaborative filtering (CF) represents one of the most successful recom-
14
mender systems used by several online shopping companies like Netflix, eBay
15
and Amazon [6]. The basic idea of collaborative filtering methods is to make
16
predictions about user preferences based upon the preferences of a group of users
17
who are considered similar to the active user [7].
M
13
As a drawback, collaborative filtering approaches usually make offline com-
19
putations. Thus, they are not able to handle dynamic problems efficiently,
20
i.e., they do not take into account new information such as new ratings. The
21
information which appears between two different offline computations is not
22
considered. Consequently, applying these techniques to real-world applications
23
is a non-trivial task, especially in the context of Big Data.
CE
PT
ED
18
24
In addition, memory-based methods have another major disadvantage re-
25
lated to the calculation of similarities. The similarity between two users is often computed only once to generate the predictions. It is considered equal and
27
unchangeable while it is not always the case [8, 9].
AC 26
2
ACCEPTED MANUSCRIPT
The following points summarize some challenges in designing CF methods:
29
• Sparsity: The underlying ratings’ matrices are sparse, i.e., most users
30
would have viewed only a small fraction of a large number of available
31
items. As a result, most of the ratings are unknown [10].
CR IP T
28
32
• Dynamic data: In this case, data are constantly changing. Thus, the
33
recommender system requires an algorithm that updates quickly and ac-
34
curately the results [8].
• Computation time: The time required for performing computational tasks
35
rises steeply as the number of users and items increases [11].
AN US
36
Some solutions that provide accurate predictions have been proposed but
38
they usually incur prohibitive computational costs, especially when applied on
39
large data sets. This lack of efficiency limits the application of these solutions
40
at the current rates of data production growth [12].
41
1.1. Contributions
M
37
In summary, the contributions of this work are as follows:
43
• We extend the recommendation algorithm proposed in [13] to enable it to
45
• We improve the performance of the algorithm based on the new designs that fit with newly emerging technologies.
CE
46
handle large-scale data effectively.
PT
44
ED
42
47
AC
48
49
50
• We adopt the personalized behavior concept of each user and the random sampling.
• We define the personalized parameters and the direct estimation to characterize the personalized behavior of each user.
51
• In addition, every single operation is performed within the RDD provided
52
by Spark, which implies that the operations are efficient and fully scalable.
3
ACCEPTED MANUSCRIPT
53
1.2. Structure The rest of this paper is organized as follows. Section 2 presents the related
55
works. Section 3 introduces the Big Data technologies used in this work. Sec-
56
tion 4 details the proposed method. Section 5 describes our experimentations.
57
Section 6 presents the discussion. Finally, section 7 concludes this paper.
58
2. Related works
CR IP T
54
Recommender systems have attracted much of attention in the past decade,
60
due to the fact that they have being adopted by many real applications [11].
61
Recommendation methods can be classified into two main categories: content-
62
based and collaborative filtering approaches [14].
AN US
59
The content-based approach attempts to predict how users will rate a set of
64
items based on their personal information and the features of the items that they
65
liked in the past [15, 11]. Meanwhile, the second category refers to collaborative
66
filtering methods that include model-based and memory-based CF.
M
63
Model-based CF constructs a model that will learn from rating matrix in
68
order to capture hidden features of users and items. Then, they predict the
69
missing ratings according to this model [11, 16].
ED
67
Due to the sparsity problem, several works have been proposed. In or-
71
der to tackle this problem, the idea of factoring the matrix is well-known in
72
the field of Machine Learning. In [17], the authors put forward an algorithm
73
based on approximating the singular value decomposition (SVD) called Ap-
74
proSVD. This technique is based on extracting some rows of a rating matrix.
75
It scales each row by an appropriate factor in order to form a smaller matrix.
76
Then ApproSVD applies the dimensionality reduction technique SVD for pro-
AC
CE
PT
70
77
ducing a good approximation of the rating matrix [17]. Ghavipour et al. [18]
78
proposed a continuous action-set learning automata (CALA)-based method to
79
alleviate the data sparsity and cold start problems. Zhou et al. [8] developed
80
an incremental algorithm based on singular value decomposition (SVD) called
81
the Incremental ApproSVD, which combines the Incremental SVD algorithm
4
ACCEPTED MANUSCRIPT
with the Approximating the Singular Value Decomposition (ApproSVD) algo-
83
rithm. Luo et al. [19] developed a hierarchical Bayesian model, which is based
84
on the assumption that the ratings and the latent features are distributed as a
85
Gaussian-Gamma distribution.
CR IP T
82
In the literature, memory-based algorithms, also known as nearest neighbor
87
methods, comprise User-based and Item-based approaches [20]. They estimate
88
ratings based on the relationships between users and items, respectively. In
89
other words, they treat all user items with statistical techniques in order to find
90
users with similar preferences (i.e., neighbors). The prediction of preferences for
91
the active user is based on neighborhood features. Several similarity functions
92
can be applied although the most popular is Pearson correlation coefficient.
93
Once the most similar items are found the prediction is computed [14]. Hu
94
et al. [3] designed a Clustering-based Collaborative Filtering approach denoted
95
ClubCF, this method aims at grouping similar services in the same clusters for
96
recommending services collaboratively. Ar et al. [21] proposed an algorithm
97
based on the User-based CF and the genetic algorithm for refining the user-to-
98
user similarity values before the prediction step. Chen et al. [22] adopted the
99
heterogeneous evolutionary clustering for clustering the data, then the rating
100
prediction is performed in each cluster according to the user-based collaborative
101
filtering.
ED
M
AN US
86
On the other hand, context-aware recommender systems, which integrates
103
the contextual information, has become one of the popular recommendation
104
approaches. For instance, Liu et al. [23] proposed a social network aided context-
CE
PT
102
aware recommender system (SoCo), which incorporate contextual information
106
and social network information to improve the recommendation quality. The
107
idea behind the proposed approach is to partition the user-item rating matrix
AC
105
108
using the random decision trees, and then employ the influence of users’ friends
109
and dimensionality reduction to infer the missing preferences using the generated
110
sub-matrices. Macedo et al. [24] focused on events available in social networks,
111
and developed a context-aware event recommendation approach, which exploites
112
the contextual signals available in Event-based social networks (EBSNs). Zhang 5
ACCEPTED MANUSCRIPT
et al. [25] designed a group-based event participation prediction framework,
114
which simultaneously uses personalized random walk with restart in the network
115
to calculate the event-specific attraction scores for users, and fuses together
116
group-based social factors.
CR IP T
113
In order to address the main issues and challenges related to Big Data and
118
recommender systems. Several recent approaches have been developed for var-
119
ious purposes. For example, Zhang et al. [26] designed a so-called Covering
120
Algorithm based on Quotient space Granularity analysis on Spark (CA-QGS)
121
to specifically recommend web services in Big Data scenario. Lee et al. [27]
122
observed that many real-world applications require both batch processing and
123
stream processing, so they employed the lambda architecture to design a restau-
124
rant recommender system over Apache Mesos. This architecture consists of
125
three layers. First, the batch layer for precomputing results using a framework
126
such as apache hadoop. Second, the speed layer for real-time processing us-
127
ing a distributed stream processing engine such as Apache Spark. Third, the
128
serving layer for storing the output from batch and speed layers, as well as for
129
responding to client queries. Chen et al. [28] proposed an algorithm called dis-
130
ease diagnosis and treatment recommendation system (DDTRS) for Big Data,
131
which is intended to suggest medical treatments based on the patients’ inspec-
132
tion reports. While Saravanan et al. [29] compared the performance of the
133
content based recommendation in apache Hadoop and Spark, and they have
134
demonstrated that developing a system based on Apache Spark can provide a
135
fast recommendation. Kupisz et al. [30] proposed a solution based on Spark
CE
PT
ED
M
AN US
117
and Hadoop, which is able to accelerate the parallel item-based collaborative
137
filtering using Tanimoto coefficient. The solution has been compared with a
138
scalable algorithm, implemented using apache Mahout and Hadoop.
AC
136
139
However, to address the cold-start problem in context of Big Data. Hsieh
140
et al. [31] designed a keyword-aware recommender system, which exploits the
141
textual descriptions of users and items using Apache Hadoop. Meanwhile, Pan-
142
igrahi et al. [32] focused on sparsity, scalability and cold start problems. So
143
they presented a hybrid distributed recommender system, which combines the 6
ACCEPTED MANUSCRIPT
Alternating Least Square, k-means algorithm, the relationship between users,
145
items and tags.
146
3. Preliminaries
CR IP T
144
147
This section covers the necessary background for understanding the rest of
148
the paper, including notation, the MapReduce programming model, the frame-
149
works Hadoop and Spark.
150
3.1. Notation
In this paper, we dealt with the recommendation problem in context of Big
152
Data. Given a set of M users U = {u1 , u2 , . . . , uM }, and a set of N items
153
I = {i1 , i2 , . . . , iN }. The preferences of each user u are represented as a rating
154
vector Ru = (ru,i1 , ru,i2 , . . . , ru,iN ), where ru,i ∈ [rmin , rmax ]. The Goal is
155
to predict what is the user’s likely scores for the unseen items, based on the
156
historical preferences.
M
Table 1 provides a comprehensive enumeration of the different symbols used
157
in this paper.
159
ED
158
AN US
151
Table 1: Symbol denotation.
160
PT
Symbol U
the set of users
I
the set of items
M
the number of users
CE AC
161
Description
N
the number of items
Ru
the set of ratings expressed by the user u
Ri
the set of ratings expressed for the item i
ru,i
the user u’s rating on item i
rˆu,i
the user u’s predicted rating on item i
D = [rmin , rmax ]
the rating domain of the dataset
|.|
the number of elements in the set
k
the number of elements of the global and personnalized behavior
7
ACCEPTED MANUSCRIPT
3.2. MapReduce and frameworks This subsection introduces MapReduce programming model and the frame-
163
164
works Hadoop and Spark.
165
3.2.1. MapReduce programming model
CR IP T
162
Mapreduce represents a powerful programming paradigm, which has be-
167
come very popular in recent years. This model was designed by Google [33]
168
for processing large-scale data. Its main potential is to enable the automatic
169
parallelization and distribution of computations. Hence, it allows inexperienced
170
programmers with parallel and distributed systems to be able to efficiently uti-
171
lize the resources of a large distributed system [34, 33].
AN US
166
172
MapReduce is based on two user-defined functions: Map and Reduce. The
174
original data are split into independent chunks, which are processed by the
175
map functions in parallel. The Map function takes the input data in a form
176
of < key, value > pairs, and transforms them into a set of intermediate <
177
key, value > pairs. MapReduce combines all the pairs associated with the same
178
intermediate key. After that, the Reduce function takes the grouped output
179
from the maps, and translates them into another < key, value > pairs.
180
3.2.2. Frameworks: Apache Hadoop and Spark
PT
ED
M
173
Apache Hadoop is the popular open source implementation of the MapRe-
182
duce programming paradigm, tailored for processing large-scale data sets in a
183
distributed computing environment. HDFS (Hadoop Distributed File System)
184
represents the distributed file system component of Hadoop, which is designed
185
for storing large data sets reliably, and providing highly fault-tolerant stor-
AC
CE
181
186
age [35, 36].
187
Apache Spark is an in-memory cluster computing framework for large-scale
188
data processing. It has emerged as a powerful successor to Hadoop, primarily
189
due to its superior capabilities for richer and faster analysis. Spark can run
190
programs up to 100x faster than Hadoop in memory, or 10x faster on disk [37].
8
ACCEPTED MANUSCRIPT
Spark revolves around the concept of Resilient Distributed Dataset (RDD).
192
RDD is Spark’s fundamental abstraction. It represents a read-only collection
193
of partitioned data, which offers an efficient data reuse in a wide range of ap-
194
plications. Generally, RDD is fault-tolerant data structure, that can be rebuilt
195
easily if a partition is lost. It offers other advantages, including optimizing
196
data placement on a distributed system, and the manipulation using a set of
197
operators [38, 39].
198
4. Proposed method for recommender system in Big Data
CR IP T
191
In this section, we present the two proposed strategies using Spark, for a
200
recommender system in context of Big Data. We focus on improving the pre-
201
diction accuracy, and the reduction of the runtime. Figs. 1-2 depict the general
202
flowchart of the two approaches. Aggregating ratings per user
AN US
199
Parallel and distributed training
HDFS Output
Prediction step
M
HDFS Training set
ED
Aggregating ratings per item
PT
Figure 1: The exact approach.
Aggregating ratings per user
Random sampling
Parallel and distributed training
Prediction step
Aggregating ratings per item
CE
HDFS Training set
Figure 2: The approximate approach based on random sampling.
The exact and the approximate stategies have in common the following steps:
204
• The collected ratings are aggregated per user.
205
• The set of preferences is grouped by item.
206
• The preferences on items and users are adopted to construct the model.
AC
203
9
HDFS Output
ACCEPTED MANUSCRIPT
• The built model is used to predict the user preferences in the future.
208
Furthermore, in the approximate strategy, the random sampling is adopted,
209
i.e. the original training set is sampled randomly. Then the same steps of the
210
exact strategy are performed.
211
4.1. Random sampling
CR IP T
207
Given a data set, which consists of M users and N items. The computational
213
time required for processing the data depends to the number of users, items and
214
ratings. However, in context of Big Data, as the number of users, items, ratings
215
becomes too large. The computational complexity becomes also very expensive,
216
which affects the efficiency of recommender systems.
AN US
212
In order to address the aforementioned shortcomings. Let |Ru | be the size
218
of preferences expressed by the user u, and let φ be the parameter of random
219
sampling. The main idea behind this step is to select randomly φ|Ru | of the
220
ratings with respect to each user u ∈ U , where φ|Ru | < |Ru |. Thus, keeping just
221
φ of the original training data, which will be used for training the algorithm,
222
rather than the original one.
223
4.2. A parallel distributed training
ED
M
217
The model consists of units organized into groups. These groups are arranged
225
in a chain structure, with each unit being a function of the unit that preceded
226
it. The directed acyclic graph depicted in Fig. 3 clarifes the overall structure of
227
the model, particularly how many units should be adopted in each group, and
CE
PT
224
228
how these units are composed together.
229
Given the observed ratings Ru and Ri , for a user u and an item i, respectively.
231
In this structure, the first goal is to take a vector X ∈ R|X| representing either
AC
230
232
Ru or Ri as input, and transform it to Pu ∈ R|D| , Pi ∈ R|D| as output. f (X) = (f (rmin) (X), . . . , f (rmax) (X))
10
(1)
AN US
CR IP T
ACCEPTED MANUSCRIPT
Figure 3: The training step.
233
Where f (r) (X) is the probability of expressing the rating r as preferences with
234
respect to X.
M
235
The key consideration of this function is to produce a low-dimensional rep-
236
resentation of the input data.
ED
237
Afterwards, it is essential to model the personalized behavior for each user
238
u. For a personal representation, each user is mapped as follows:
PT
239
Wu = g(Pu ) = (wu(1) , . . . , wu(k) )
Where
CE
max(P ) u = max(P Pu(j−1) )
AC
wu(j)
if j = 1 if 2 ≤ j ≤ k
P Pu(j)
(2)
P wu(j) u = P Pu(j−1) wu(j)
if j = 1 if 2 ≤ j ≤ k (3)
240
The function g(Pu ) is devoted to capture the most relevant aspects for a
241
single user u. The vector Wu ∈ Rk allows to identify the type of user behavior
242
such as: optimistic behavior, pessimistic behavior. P Pu
243
elements of vector P Pu
(j−1)
(j−1)
(j)
(j)
wu represents all
except the element wu . k is the vector dimension.
244
11
ACCEPTED MANUSCRIPT
In the case of items, the goal is to generalize the explicit ratings by users
245
regarding their interest in the items. So each item i is defined as follows: (1)
(k)
Wi = hi (Pi ) = (wi , . . . , wi ) Where (j)
wi
= f (b
(j)
)
(Ri )
(4)
CR IP T
246
(5)
B = (b(1) , . . . , b(k) )
Where the probability of expressing the rating b(1) in the system is higher
248
than the probability of expressing b(2) , and so on. The vector Wi ∈ Rk ap-
249
proximates the opinions of most users about an item i. The basic idea behind
250
the function hi (Pi ) is to build the item representation by considering the global
251
aspects, through analysing all the previously observed ratings in the system,
252
instead of the personnal aspects employed for users.
AN US
247
253
M
Then, the resulting representations for a user u, an item i are aggregated as follows:
Vu = s(Wu ) =
k X
wu(j) .r
ED
j=1
Vi = s(Wi ) =
k X
(6) (j) wi .r
j=1
Due to the fact that each user u rates a few items, an appropriate mechanism
255
is necessary for handling data sparsity. The model enables the connections only
256
from the unit of a user u (i.e. Vu ∈ R) and the units of items rated by u (i.e.
CE
PT
254
Vi ∈ R where ru,i 6= 0). The weight of the user u is defined by the following
258
function:
AC
257
259
260
αu = t(Vu , Vi1 , Vi2 , . . . , V|I| ) =
P
ru,i ∈Ru (Vu
P
ru,i ∈Ru
+ Vi )
ru,i
(7)
Where αu is a real value, which adjusts the interaction between the enabled
units (i.e. the items and the user u).
261
12
ACCEPTED MANUSCRIPT
In order to improve the direct estimation computed by the function t, and to avoid the grid search in the hyperparameter space. It is essential to take into account the standard deviation of the estimated parameters with respect
∆ = sd(αu1 , αu2 , . . . , αu|U | ) =
sP
CR IP T
to each user u. It is calculated as follows: u∈U (αu
α ¯=
M P
−α ¯ )2
(8)
u∈U
αu
M
After that, the weight ∆ ∈ R is shared across all the units within the next
262
group. As result, the optimal value Γu ∈ R of the user u is determined as
264
follows:
AN US
263
X (Vu + Vi ) Γu = y(x) = argmin ru,i − , x ∈ {αu − ∆/2, αu , αu + ∆/2} x x ru,i ∈Ru
(9)
If the parameter Γu is 2, this means that it has no effect on predictions for
266
u. Otherwise, the parameter Γu controls the interactions between the units,
267
according to the personalized behavior of the user u.
ED
M
265
268
Then, to consider how the units affect the predictions, after controlling the
ru,i ∈Ru
(10)
CE
PT
interactions through Γu . The parameter θu∗ is defined as follows: X ru,i − 2((1 − θu )Vu + θu Vi ) , θu ∈ {a, . . . , d} θu∗ = q(θu ) = argmin Γu θu
Where a, d are the maximum and minimum possible values of θu , respectively.
270
The role of the parameter θu∗ ∈ R is to emphasis the predicted opinions with
271
respect to the personalized behavior of u, thus producing more appropriate
272
predictions.
273
4.2.1. Algorithm
AC
269
274
The training step in a distributed fashion is described as follows:
13
ACCEPTED MANUSCRIPT
Algorithm 1 : Training step Input: k: the dimension of the behavior, TR: path of training data in HDFS Output: RDDout : RDD of the estimated parameters
2:
RDDusers < − RDDtrain .map((user,item,rating)=>(user,rating)).combineByKey;
3:
RDDitems < − RDDtrain .map((user,item,rating)=>(item,rating)).combineByKey;
4:
RDDusers f < − RDDusers .mapValues(apply the function f);
5:
RDDitems f < − RDDitems .mapValues(apply the function f);
6:
RDDusers g < − RDDusers f .mapValues(apply the function g);
7:
RDDitems h < − RDDitems f .mapValues(apply the function g);
8:
RDDusers s < − RDDusers g.mapValues(apply the function s);
9:
RDDitems s < − RDDitems h.mapValues(apply the function s);
AN US
CR IP T
RDDtrain < − sc.textFile(TR);
RDD t < − RDDJOIN .combineByKey.mapValues(apply the function t);
11:
RDD sd < − (RDD t.map(X=> αuser ).stdev)/2.0;
12:
RDD y < − RDD t.mapValues(apply the function y);
13:
RDD q < − RDD y.mapValues(apply the function q));
14:
RDDout < − RDD q.map(X=>(user,(Γuser ,θuser )));
15:
return RDDout ;
ED
M
10:
4.3. Prediction step
PT
275
1:
To predict a rating ru,i for an item i that was not yet evaluated by a user
277
u. Eq. 11 is employed to perform this task, according to the personnalized
278
behavior of the user u.
AC
CE
276
rˆu,i =
2((1 − θu∗ )Vu + θu∗ Vi ) Γu
14
(11)
ACCEPTED MANUSCRIPT
279
5. Experimentations This section describes the experimentations performed. Section 5.1 presents
281
the data sets used in this work. Section 5.2 presents the methodology. Section
282
5.3 describes the performance measures. Section 5.4 summarises the parameters
283
of the algorithms. Section 5.5 presents the hardware and software used. Finally,
284
section 5.6 details the experimental results.
285
5.1. Data Sets
In this work, the experimentations are conducted on two benchmark data
286
sets: Movielens 10M and Movielens 20M [40].
AN US
287
CR IP T
280
288
MovieLens 10M: it is a movie rating dataset, which contains 10,000,054
289
ratings of 10,681 movies made by 71,567 users of the online movie recommender
290
service. All users had rated at least 20 movies.
MovieLens 20M: this data set contains 20,000,263 ratings from 138,493
291
users on 27,278 movies. All users had rated at least 20 movies.
293
5.2. Methodology
For the experimental study, all the data sets are partitioned according to
ED
294
295
M
292
the following methodologies:
• Methodology 1: the data set is divided into two parts. The training set for
297
training the model and the test set for evaluating the model’s performance.
PT
296
The training set contains 80% of the ratings, while the test set includes
298
20% of the ratings. This procedure is repeated 20 times. Then the average
CE
299
is calculated.
300
• Methodology 2: the data set is divided randomly into two sets. 80% of
AC
301
302
ratings in the training set, and 20% in the test set.
303
5.3. Performance measures
304
Evaluation metric. The performance is evaluated by the Mean Absolute Error
305
(MAE) and Root Mean Square Error (RMSE), which are widely used for assess-
306
ing the performance of recommender systems [41, 42, 43, 3, 22, 21, 44, 45, 46]. 15
ACCEPTED MANUSCRIPT
MAE and RMSE are defined as follows: P
(u,i)∈T
M AE =
308
RM SE =
sP
|ru,i − rˆu,i |
(12)
|T |
(u,i)∈T (ru,i
|T |
− rˆu,i )2
(13)
CR IP T
307
Where T is the test set of size |T |, ru,i represents the known rating, while rˆu,i
310
denotes the predicted value.
311
In general, smaller value indicate better prediction quality.
312
Runtime. The total time spent by the parallel algorithm, including distributing
313
the data, the training and the prediction tasks.
314
Statistical tests. Student’s t-test represents a statistical test which is widely
315
used to assess whether the difference between the returned error values are
316
statistically significant or due to chance. These tests allow us to reject null
317
hypothesis which states that there is no relationship between the two measured
318
values [47]. In this work, according to the methodology 1, each experimentation
319
was performed twenty times. So the returned MAE and RMSE results are
320
compared in order to prove whether the differences between the results are
321
statistically significant.
322
5.4. Parameter setting
ED
M
AN US
309
An extensive number of experimentations were conducted in order to de-
324
termine the best settings. The parameters with respect to each method are
325
described as follows:
CE
PT
323
AC
326
Table 2: Parameter settings for the considred methods.
Method
Parameter values
CESP
The number of category experts nce = 100
UPCC
The number of nearest-neighbors nn = 300
FRAIPA
λ = 0.025, αu ∈ [1.5, 1.5 + λ, 1.5 + 2λ, . . . , 2]
Approximate approach
φ ∈ [0.4, 0.8], θu ∈ [0.1, 0.2, . . . , 0.8, 0.9], k = 6
Exact approach
θu ∈ [0.1, 0.2, . . . , 0.8, 0.9], k = 6
327
16
ACCEPTED MANUSCRIPT
5.5. Hardware and software used The experiments were executed on a cluster which is composed of two nodes:
329
330
one master node, and a slave node. Each node has the following features:
CR IP T
328
331
• Processor: 2 cores (4 threads)
332
• Clock speed: 2.93 GHz
333
• RAM: 4 GB
334
In this work, all methods were written in Scala. The details of the operating
• Spark 2.1.0
337
• Hadoop-2.7.4
338
• Scala-2.11.8
339
• Ubuntu 16.04 LTS
340
• Mahout 0.13.0
341
5.6. Experimental results
This section presents the experimentations conducted on the two benchmark
342
data sets.
PT
343
ED
336
AN US
system and the frameworks used are as follows:
M
335
In fact, either the approximate or the exact approach represents an improved
345
version of FRAIPA approach adapted in Big Data environment. Therefore, to
CE
344
evaluate the performance, the results obtained by the proposed approaches are
347
compared with the following state-of-the-art methods:
AC
346
348
349
350
351
• CESP: Efficient recommendation methods using category experts for a large dataset [48]. • FRAIPA: A fast recommendation approach with improved prediction accuracy [13].
17
ACCEPTED MANUSCRIPT
• UPCC: User-based collaborative filtering method using pearson correla-
352
tion [9].
353
• ITAN: Item-based collaborative filtering method using tanimoto coeffi-
354
CR IP T
cient [30].
355
356
For the approximate method, the parameter φ is very important. It controls
357
how much data is enough to perform the training step. Figs. 4-5 show how the
358
parameter φ leverages the rating prediction accuracy on the Movielens 10M and
359
Movielens 20M datasets.
The approximate method shows high-quality prediction with the increase of
361
φ value. This is due to the fact that considering more training data leads to
362
improve the results.
AN US
360
On the other hand, to investigate how the parameter φ affects the compu-
364
tational time of the approximate method. Fig. 6 depicts the results, where φ
365
lies in the interval [0.4, 0.8].
M
363
Movielens 10M
0.680
MAE
ED
0.675 0.670 0.665
PT
0.660
AC
0.7
0.6
φ
0.5
0.4
0.5
0.4
Movielens 20M
0.670 0.665
MAE
CE
0.8
0.660 0.655 0.650 0.8
0.7
0.6
φ
Figure 4: The impact of φ on the MAE (methodology 1).
18
ACCEPTED MANUSCRIPT
Movielens 10M 0.8875
0.8825 0.8800 0.8775 0.8750 0.8
0.7
0.6
φ
0.5
Movielens 20M 0.8800
0.8750 0.8725 0.8700 0.8
0.7
0.4
AN US
RMSE
0.8775
CR IP T
RMSE
0.8850
0.6
φ
0.5
0.4
Figure 5: The impact of φ on the RMSE (methodology 1).
M
Movielens 10M
200
ED
Time (s)
300
100
0
0.7
PT
0.8
0.6
φ
0.5
0.4
0.5
0.4
Movielens 20M
AC
400
Time (s)
CE
500
300 200 100 0 0.8
0.7
0.6
φ
Figure 6: The impact of φ on the runtime (methodology 1).
19
ACCEPTED MANUSCRIPT
Tables 3 and 4 report the results obtained by each algorithm, according to
366
367
the methodology 2, methodology 1, respectively.
ology 2).
Movielens 10M
Movielens 20M
φ
MAE
RMSE
MAE
0.4
0.6711
0.8858
0.6604
0.5
0.6681
0.8818
0.6576
Approximate
0.6
0.6662
0.8790
0.6557
approach
0.7
0.6649
0.8769
0.6544
0.6638
0.8754
0.6534
0.8694
0.6623
0.8735
0.6519
0.8674
0.6870
0.8988
-
-
0.6758
0.8892
0.6643
0.8822
0.8218
1.0537
0.8261
1.0621
0.7308
0.9370
0.7131
0.9225
0.8 -
CESP [48]
-
FRAIPA [13]
-
UPCC [9]
-
ITAN [30]
-
0.8796 0.8755 0.8728 0.8708
M
Exact approach
RMSE
AN US
Method
369
CR IP T
Table 3: Performance comparisons on Movielens 10M and Movielens 20M data sets (method368
Table 4: Performance comparisons on Movielens 10M and Movielens 20M data sets (methodology 1).
371
ED
370
Movielens 10M
Movielens 20M
MAE ± σ
RMSE ± σ
Elapsed time (sec.)
MAE ± σ
RMSE ± σ
Elapsed time (sec.)
0.6715 ± 0.0002
0.8861 ± 0.0003
84.05
0.6605 ± 0.0002
0.8795 ± 0.0003
142.15
0.5
0.6687 ± 0.0003
0.8822 ± 0.0003
106.35
0.6577 ± 0.0002
0.8756 ± 0.0004
186.00
Approximate
0.6
0.6667 ± 0.0003
0.8794 ± 0.0004
130.65
0.6557 ± 0.0002
0.8729 ± 0.0003
237.75
approach
0.7
0.6653 ± 0.0003
0.8774 ± 0.0004
167.20
0.6544 ± 0.0002
0.8710 ± 0.0004
314.00
0.8
0.6643 ± 0.0003
0.8761 ± 0.0003
190.25
0.6534 ± 0.0002
0.8696 ± 0.0003
423.00
Exact approach
-
0.6628 ± 0.0003
0.8740 ± 0.0003
239.2
0.6519 ± 0.0002
0.8676 ± 0.0003
489.2
FRAIPA
-
0.6764 ± 0.0003
0.8899 ± 0.0004
206.9
0.6644 ± 0.0002
0.8823 ± 0.0003
464.55
Method
φ
PT
0.4
AC
CE
372
20
ACCEPTED MANUSCRIPT
373
As it can be observed, for the approximate method, as the value of φ grows,
374
the performance in terms of MAE and RMSE becomes better, while the com-
375
putational time increases at the same time. On the other hand, in order to evaluate if the MAE and RMSE values
377
returned by two methods are statistically significant [47]. Student’s t-test is
378
adopted for performing this task. Table 5 provides the results of the t-test for
379
both data sets.
CR IP T
376
Table 5: The results of statistical tests (methodology 1).
380
381
Methods 382
Approximate vs. FRAIPA
RMSE
RMSE
φ
t
p − value
t
p − value
t
p − value
t
p − value
0.4
-48.03560
1.242392e-35
-31.76410
5.676034e-29
-45.37845
1.041547e-34
-22.73346
1.002181e-23
0.5
-73.09901
1.735525e-42
-63.78783
2.952940e-40
-75.75717
4.506952e-43
-53.63072
2.003388e-37
0.6
-91.35325
3.810803e-46
-80.70710
4.127006e-44
-102.16131
5.532934e-48
-77.28235
2.123250e-43
0.7
-108.03062
6.670309e-49
-96.47901
4.828962e-47
-114.44182
7.508449e-50
-89.84104
7.165563e-46
0.8
-116.31963
4.052301e-50
-107.58943
7.788731e-49
-129.54865
6.830205e-52
-104.77415
2.126280e-48
-
-130.59199
5.039005e-52
-128.14769
1.031512e-51
-152.89147
1.274816e-54
-124.14316
3.437168e-51
Clearly, the results prove that the difference in terms of MAE, RMSE im-
383
provements are statistically significant (p-value <0.05)
385
6. Discussion
ED
384
The contributions of this work pertain to handling large-scale data efficiently,
PT
386
387
Movielens 20M
MAE
M
Exact vs. FRAIPA
AN US
Movielens 10M MAE
improving the quality of predictions, and reducing the computational time. The obtained results show that the exact approach is able to produce signif-
CE
388
icantly better prediction quality than other state-of-the-art competitors. How-
390
ever, thanks to random sampling, the approximate approach works well for Big
391
Data. It can achieve good results. Indeed, as the value of φ grows, the prediction
392
quality becomes more accurate.
AC
389
393
394
Table 6 depicts the minimum, average, and maximum error values for the
exact and aproximate approaches, with respect to each data set.
21
ACCEPTED MANUSCRIPT
Table 6: Minimum, average, maximum MAE and RMSE values (methodology 1).
395
396
Method 397
MAE
RMSE
φ
Min
Average
Max
Min
Average
Max
Min
Average
Max
Min
Average
Max
0.4
0.6711
0.6715
0.6720
0.8856
0.8861
0.8867
0.6602
0.6605
0.6612
0.8788
0.8795
0.8804
0.5
0.6681
0.6687
0.6693
0.8814
0.8822
0.8827
0.6573
0.6577
0.6583
0.8748
0.8756
0.8765
Approximate
0.6
0.6661
0.6667
0.6672
0.8782
0.8794
0.8800
0.6554
0.6557
0.6563
0.8723
0.8729
0.8736
approach
0.7
0.6649
0.6653
0.6658
0.8768
0.8774
0.8780
0.6540
0.6544
0.6550
0.8702
0.8710
0.8718
0.8
0.6638
0.6643
0.6648
0.8752
0.8761
0.8766
0.6531
0.6534
0.6540
0.8689
0.8696
0.8704
-
0.6623
0.6628
0.6633
0.8732
0.8740
0.8745
0.6516
0.6519
0.6524
0.8669
0.8676
0.8683
Exact approach
AN US
Furthermore, the t-test provides additional evidence to confirm that the
398
399
Movielens 20M RMSE
CR IP T
Movielens 10M MAE
RMSE and MAE improvements are statistically significant.
In terms of computational time, our proposal runs much faster than the
401
other methods, which conforms to the time complexity O(M N ). In particular,
402
the approximate approach is less time consuming as compared to the exact
403
approach. It can save between 8.04% and 59.37% of the computational time
404
for MovieLens 10M, and between 8.94% and 69.40% for MovieLens 20M. This
405
makes it optimal for processing large-scale data sets.
M
400
On the other hand, it is known that the personalized behaviors of users are
407
independent, and the parameters of the exact and the approximate approaches
408
are estimated with respect to each user. Thus, they provide the flexibility to
409
update the personalized behaviors and the parameters efficiently, by considering
410
just the new preferences in the system without offline computation.
PT
ED
406
As a result, it is reasonable to conclude that the proposed approaches con-
412
tribute to more efficient design of recommender systems in context of Big Data.
CE
411
7. Conclusions
AC 413
414
Recommender system represents useful tool for many online activities. How-
415
ever, with the rapid growth of information and the large number of users and
416
items, recommender system in context of Big Data has become an important
22
ACCEPTED MANUSCRIPT
417
solution to cope with the limitations associated with the use of traditional rec-
418
ommender systems. In this paper, we have developed a novel solution for the recommendation
420
in context of Big Data. It is designed based on apache spark for processing
421
large-scale data efficiently.
CR IP T
419
The basic idea behind this proposal is to employ the concept of personalized
423
behavior for each user, the representation of each item by approximating the
424
opinions of users, the parallel and distributed training, to address some chal-
425
lenges related to Big Data and recommender systems. Besides, improving the
426
quality of predictions, while reducing the computational time.
AN US
422
427
Furthermore, the proposed solution uses only the user-item rating data to
428
make predictions, thus it can be easily adopted as a recommender system of
429
real-world e-commerce companies.
The experiments carried out have demonstrated that the proposed solution
431
outperforms the state-of-the-art algorithms in both efficiency, and effectiveness.
432
In addition, it can be successfully applied to large datasets.
M
430
As part of future work, we plan to study the integration of various sources
433
of data, for improving the performance of the proposed solution.
435
Acknowledgments
PT
ED
434
This research did not receive any specific grant from funding agencies in the
436
public, commercial, or not-for-profit sectors.
CE
437
438
[1] G. Bello-Orgaz, J. J. Jung, D. Camacho, Social big data: Recent achieve-
AC
439
References
440
ments and new challenges, Information Fusion 28 (2016) 45–59.
441
[2] H. Wang, Z. Xu, H. Fujita, S. Liu, Towards felicitous decision making: An
442
overview on challenges and trends of big data, Information Sciences 367
443
(2016) 747–765.
23
ACCEPTED MANUSCRIPT
444
[3] R. Hu, W. Dou, J. Liu, Clubcf: A clustering-based collaborative filtering
445
approach for big data application, IEEE transactions on emerging topics
446
in computing 2 (2014) 302–313. [4] C. P. Chen, C.-Y. Zhang, Data-intensive applications, challenges, tech-
448
niques and technologies: A survey on big data, Information Sciences 275
449
(2014) 314–347.
CR IP T
447
[5] C. V. Sundermann, M. A. Domingues, M. da Silva Conrado, S. O. Rezende,
451
Privileged contextual information for context-aware recommender systems,
452
Expert Systems with Applications 57 (2016) 139–158.
AN US
450
453
[6] Z. Yang, L. Xu, Z. Cai, Z. Xu, Re-scale adaboost for attack detection
454
in collaborative filtering recommender systems, Knowledge-Based Systems
455
100 (2016) 74–88.
[7] J. Bobadilla, F. Serradilla, J. Bernal, A new collaborative filtering metric
457
that improves the behavior of recommender systems, Knowledge-Based
458
Systems 23 (2010) 520–528.
M
456
[8] X. Zhou, J. He, G. Huang, Y. Zhang, Svd-based incremental approaches
460
for recommender systems, Journal of Computer and System Sciences 81
461
(2015) 717–733.
PT
ED
459
[9] F. Zhang, T. Gong, V. E. Lee, G. Zhao, C. Rong, G. Qu, Fast algorithms
463
to evaluate collaborative filtering recommender systems, Knowledge-Based
464
Systems 96 (2016) 96–103.
CE
462
[10] C. C. Aggarwal, Recommender Systems, Springer, 2016.
AC
465
466
467
468
[11] A. Salah, N. Rogovschi, M. Nadif, A dynamic collaborative filtering system via a weighted clustering approach, Neurocomputing 175 (2016) 206–215.
[12] F. Petroni, L. Querzoni, R. Beraldi, M. Paolucci,
Lcbm: a fast and
469
lightweight collaborative filtering algorithm for binary ratings, Journal
470
of Systems and Software 117 (2016) 583–594. 24
ACCEPTED MANUSCRIPT
471
[13] B. A. Hammou, A. A. Lahcen, Fraipa: A fast recommendation approach
472
with improved prediction accuracy, Expert Systems with Applications 87
473
(2017) 90–97. ´ L. S´ [14] M. N. Moreno, S. Segrera, V. F. L´ opez, M. D. Mu˜ noz, A. anchez,
475
Web mining based framework for solving usual problems in recommender
476
systems. a case study for movies recommendation, Neurocomputing 176
477
(2016) 72–80.
CR IP T
474
[15] J. Lu, D. Wu, M. Mao, W. Wang, G. Zhang, Recommender system appli-
479
cation developments: a survey, Decision Support Systems 74 (2015) 12–32.
480
[16] H. Liu, Z. Hu, A. Mian, H. Tian, X. Zhu, A new user similarity model to
481
improve the accuracy of collaborative filtering, Knowledge-Based Systems
482
56 (2014) 156–166.
AN US
478
[17] X. Zhou, J. He, G. Huang, Y. Zhang, A personalized recommendation
484
algorithm based on approximating the singular value decomposition (ap-
485
prosvd), in: Proceedings of the The 2012 IEEE/WIC/ACM International
486
Joint Conferences on Web Intelligence and Intelligent Agent Technology-
487
Volume 02, IEEE Computer Society, 2012, pp. 458–464.
ED
M
483
[18] M. Ghavipour, M. R. Meybodi, An adaptive fuzzy recommender system
489
based on learning automata, Electronic Commerce Research and Applica-
490
tions 20 (2016) 105–115. [19] C. Luo, B. Zhang, Y. Xiang, M. Qi, Gaussian-gamma collaborative filter-
CE
491
PT
488
ing: A hierarchical bayesian model for recommender systems, Journal of
493
Computer and System Sciences (2017).
AC
492
494
[20] M. D. Ekstrand, J. T. Riedl, J. A. Konstan, et al., Collaborative filtering
495
R in Human–Computer recommender systems, Foundations and Trends
496
Interaction 4 (2011) 81–173.
497
498
[21] Y. Ar, E. Bostanci, A genetic algorithm solution to the collaborative filtering problem, Expert Systems with Applications 61 (2016) 122–128. 25
ACCEPTED MANUSCRIPT
499
[22] J. Chen, H. Wang, Z. Yan, et al., Evolutionary heterogeneous cluster-
500
ing for rating prediction based on user collaborative filtering, Swarm and
501
Evolutionary Computation (2017). [23] X. Liu, K. Aberer, Soco: a social network aided context-aware recom-
503
mender system, in: Proceedings of the 22nd international conference on
504
World Wide Web, ACM, 2013, pp. 781–802.
CR IP T
502
[24] A. Q. Macedo, L. B. Marinho, R. L. Santos, Context-aware event recom-
506
mendation in event-based social networks, in: Proceedings of the 9th ACM
507
Conference on Recommender Systems, ACM, 2015, pp. 123–130.
508
AN US
505
[25] S. Zhang, Q. Lv, Hybrid egu-based group event participation prediction in event-based social networks, Knowledge-Based Systems (2017).
509
[26] Y.-w. Zhang, Y.-y. Zhou, F.-t. Wang, Z. Sun, Q. He, Service recommenda-
511
tion based on quotient space granularity analysis and covering algorithm
512
on spark, Knowledge-Based Systems (2018).
M
510
[27] C.-H. Lee, C.-Y. Lin, Implementation of lambda architecture: A restau-
514
rant recommender system over apache mesos, in: Advanced Information
515
Networking and Applications (AINA), 2017 IEEE 31st International Con-
516
ference on, IEEE, 2017, pp. 979–985.
ED
513
[28] J. Chen, K. Li, H. Rong, K. Bilal, N. Yang, K. Li, A disease diagnosis and
518
treatment recommendation system based on big data mining and cloud
519
computing, Information Sciences (2018).
CE
PT
517
[29] S. Saravanan, K. Karthick, A. Balaji, A. Sajith, Performance comparison
521
of apache spark and hadoop based large scale content based recommender
AC
520
522
system, in: The International Symposium on Intelligent Systems Technolo-
523
gies and Applications, Springer, 2017, pp. 66–73.
524
[30] B. Kupisz, O. Unold, Collaborative filtering recommendation algorithm
525
based on hadoop and spark, in: Industrial Technology (ICIT), 2015 IEEE
526
International Conference on, IEEE, 2015, pp. 1510–1514. 26
ACCEPTED MANUSCRIPT
527
[31] M.-Y. Hsieh, T.-H. Weng, K.-C. Li, A keyword-aware recommender system
528
using implicit feedback on hadoop, Journal of Parallel and Distributed
529
Computing (2018). [32] S. Panigrahi, R. K. Lenka, A. Stitipragyan, A hybrid distributed collabora-
531
tive filtering recommender engine using apache spark, Procedia Computer
532
Science 83 (2016) 1000–1006.
533
CR IP T
530
[33] J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large clusters, Communications of the ACM 51 (2008) 107–113.
534
[34] D. Jiang, A. K. Tung, G. Chen, Map-join-reduce: Toward scalable and
536
efficient data analysis on large clusters, IEEE Transactions on Knowledge
537
and Data Engineering 23 (2011) 1299–1311.
AN US
535
[35] J. Maillo, S. Ram´ırez, I. Triguero, F. Herrera, knn-is: An iterative spark-
539
based design of the k-nearest neighbors classifier for big data, Knowledge-
540
Based Systems 117 (2017) 3–15.
M
538
[36] F. Pulgar-Rubio, A. Rivera-Rivas, M. D. P´erez-Godoy, P. Gonz´ alez, C. J.
542
Carmona, M. del Jesus, Mefasd-bd: Multi-objective evolutionary fuzzy
543
algorithm for subgroup discovery in big data environments-a mapreduce
544
solution, Knowledge-Based Systems 117 (2017) 70–78.
PT
545
ED
541
[37] Apache spark, 2017. URL: https://spark.apache.org, accessed: 06 Dec 2017.
CE
546
[38] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J.
548
Franklin, S. Shenker, I. Stoica, Resilient distributed datasets: A fault-
549
tolerant abstraction for in-memory cluster computing, in: Proceedings of
550
the 9th USENIX conference on Networked Systems Design and Implemen-
551
tation, USENIX Association, 2012, pp. 2–2.
AC
547
552
553
[39] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, I. Stoica, Spark: Cluster computing with working sets., HotCloud 10 (2010) 95.
27
ACCEPTED MANUSCRIPT
554
[40] F. M. Harper, J. A. Konstan, The movielens datasets: History and context, ACM Transactions on Interactive Intelligent Systems (TiiS) 5 (2016) 19.
555
[41] G. Zhao, X. Qian, X. Xie, User-service rating prediction by exploring
557
social users’ rating behaviors, IEEE Transactions on Multimedia 18 (2016)
558
496–506.
CR IP T
556
559
[42] X. Ma, H. Lu, Z. Gan, Q. Zhao, An exploration of improving prediction
560
accuracy by constructing a multi-type clustering based recommendation
561
framework, Neurocomputing 191 (2016) 388–397.
[43] D. Wu, J. Lu, G. Zhang, A fuzzy tree matching-based personalized e-
563
learning recommender system, IEEE Transactions on Fuzzy Systems 23
564
(2015) 2412–2426.
AN US
562
[44] M. Al-Hassan, H. Lu, J. Lu, A semantic enhanced hybrid recommendation
566
approach: A case study of e-government tourism service recommendation
567
system, Decision Support Systems 72 (2015) 97–109.
M
565
[45] M. G. Vozalis, K. G. Margaritis, Using svd and demographic data for the
569
enhancement of generalized collaborative filtering, Information Sciences
570
177 (2007) 3017–3037.
ED
568
[46] M.-L. Wu, C.-H. Chang, R.-Z. Liu, Integrating content-based filtering with
572
collaborative filtering using co-clustering with augmented matrices, Expert
573
Systems with Applications 41 (2014) 2754–2761.
CE
PT
571
[47] L. Boratto, S. Carta, G. Fenu, Discovery and representation of the prefer-
575
ences of automatically detected groups: Exploiting the link between group
576
modeling and clustering, Future Generation Computer Systems 64 (2016)
577
165–174.
AC
574
578
[48] W.-S. Hwang, H.-J. Lee, S.-W. Kim, Y. Won, M.-s. Lee, Efficient recom-
579
mendation methods using category experts for a large dataset, Information
580
Fusion 28 (2016) 75–82.
28