Accepted Manuscript
Efficient Global Correlation Measures for a Collaborative Filtering Dataset Adrian Satja Kurdija, Marin Silic, Klemo Vladimir, Goran Delac PII: DOI: Reference:
S0950-7051(18)30065-0 10.1016/j.knosys.2018.02.013 KNOSYS 4221
To appear in:
Knowledge-Based Systems
Received date: Revised date: Accepted date:
21 July 2017 25 January 2018 6 February 2018
Please cite this article as: Adrian Satja Kurdija, Marin Silic, Klemo Vladimir, Goran Delac, Efficient Global Correlation Measures for a Collaborative Filtering Dataset, Knowledge-Based Systems (2018), doi: 10.1016/j.knosys.2018.02.013
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Efficient Global Correlation Measures for a Collaborative Filtering Dataset
CR IP T
Adrian Satja Kurdija, Marin Silic, Klemo Vladimir, and Goran Delac University of Zagreb, Faculty of Electrical Engineering and Computing, Consumer Computing Lab, Unska 3, 10000 Zagreb, Croatia
AN US
Abstract
Recommender systems based on collaborative filtering (CF) rely on datasets containing users’ taste preferences for various items. Accuracy of various prediction approaches depends on the amount of similarity between users and
M
items in a dataset. As a heuristic estimate of this data quality aspect, which could serve as an indicator of the prediction ability, we define the Global
ED
User Correlation Measure (GUCM) and the Global Item Correlation Measure (GICM) of a dataset containing known user-item ratings. The proposed
PT
measures range from 0 to 1 and describe the quality of the dataset regarding the user-user and item-item similarities: a higher measure indicates more
CE
similar pairs and better prediction ability. The experiments show a correlation between the proposed measures and the accuracy of standard prediction
AC
models. The measures can be used to quickly estimate whether a dataset is suitable for collaborative filtering and whether we can expect high prediction accuracy of user-based or item-based CF approaches. Keywords: collaborative filtering, dataset quality, global correlation, user Preprint submitted to Knowledge-Based Systems
February 13, 2018
ACCEPTED MANUSCRIPT
similarity, item similarity
CR IP T
1. Introduction Recommender systems are based on predictions, which rely on datasets
containing known values. In collaborative filtering (CF), datasets contain known ratings corresponding to user-item pairs. CF models are evaluated
on various datasets, and results of a single model across different datasets
AN US
are highly variable, since datasets typically exhibit different properties and yield different prediction accuracies. This introduces the need to measure the quality of a dataset which would be an indication of its prediction difficulty. We can use this measure as a reference point when proposing and evaluating
M
a prediction model on a certain dataset. Developing a prediction model is expensive and time consuming, and quick heuristic indicators of potential
ED
prediction accuracy on a dataset can be very valuable; for example, when deciding which additional data to take into account. For instance, an e-
PT
commerce provider can use this approach when designing the CF process. Specifically, the item vector can be structured in a way to take into account
CE
newly collected data, e.g. in case a new type of rating is introduced. The quality measure can be used to evaluate how the new dataset would behave
AC
if collaborative filtering was to be extended to these new attributes. In effect, the measure would estimate the level of maturity that the available collected data would give to the newly designed CF process. In other words, knowing the proposed measure of a dataset could help us predict whether 2
ACCEPTED MANUSCRIPT
high prediction accuracy can be expected. Other potential uses can be found for educational and scientific purposes.
CR IP T
It is often the case when examining a new approach that the required specific dataset is not available. In such cases, the researchers often rely on synthetically generated data. The proposed measure could be used to assess if the generated dataset resembles a realistic CF dataset (is it suitable for
making predictions) or its values behave as if they were random, and if so,
the artificial dataset generators.
AN US
to which degree. Consequently, the measure can be used to further fine-tune
To make predictions, all CF models explicitly or implicitly rely on similarity between users and/or similarity between items. Prediction accuracy,
M
and therefore the reliability of a recommender system, depends on the degree of correlation of datasets’ values, which is opposed to randomness and
ED
noise. To the best of our knowledge, there are not many global correlationbased measures of a dataset quality, and existing ones include calculation
PT
of many user-user or item-item similarities, which is not more efficient then
CE
calculating all predictions. Our work presents the following contributions: • We define two novel heuristic measures of a CF dataset quality, de-
AC
signed to quickly estimate the amount of global user correlation and item correlation.
• Using comprehensive experiments on synthetic and real-world data, we show that the proposed measures satisfy several desirable properties
3
ACCEPTED MANUSCRIPT
such as correlating with the the accuracy of standard prediction models.
CR IP T
• We derive several applicable principles from the results. The rest of the paper is structured as follows. Related work is discussed in Sect. 2. Sect. 3 gives the formal preliminaries and defines the desired properties of the proposed measures. Sect. 4 gives precise definitions and
the algorithm for calculating the measures. Sect. 5 and Sect. 6 describe the
AN US
experimental results on artificial and real-world datasets. Conclusions are given in Sect. 7. 2. Related work
M
Data quality has multiple aspects, discussed in e.g. [1] and [2]. Natural noise, based on user inconsistency, was discussed in [3], [4], and showed
ED
to constrain the prediction ability beyond some maximal (”magic barrier”) value [5]. Another aspect is missing data, which is related to dataset spar-
PT
sity, specifically discussed in [6]. However, sparsity is information about the amount of available data, not about its properties; for example, it does
CE
not capture whether there are many (or any) users with similar preferences. Marlin et al. [7] have analyzed which data is missing, showing and utilizing
AC
non-randomness of the missing data distribution. User-user and item-item correlations are usually calculated using Pear-
son Correlation (PCC) and less often using vector cosine, Spearman rank or Kendall’s τ correlation [8]. However, these measures are defined on the 4
ACCEPTED MANUSCRIPT
level on individual users/items only. Computing all pairwise similarities to count the number of similar pairs (e.g., with P CC > 0.5) is infeasible for
CR IP T
large datasets. Recently, [9] measured sparsity and redundancy for setting the sampling levels in a CF model. Redundancy was defined using the number of users, items, and ratings, as well as the average pairwise similarity of
users having at least one jointly rated item. As it is the case for the number of similar pairs, this measure is computationally demanding as it poten-
AN US
tially calculates O(N 2 ) user-user similarities, each of which takes at least O(ratings per user) time, which gives a very impractical time complexity for datasets with e.g. 105 users.
Our approach attempts to obtain a related global measure more efficiently.
M
It is based on user-user and item-item similarities, assuming that in a ”highquality” dataset there are groups of highly correlated users (we can predict
ED
the behavior of one based on the others) and groups of highly correlated items (we can predict the ratings of one based on the others). As these are,
PT
in general, two distinct properties, we are actually proposing two measures: one for the user correlation degree and one for the item correlation degree. As
CE
a consequence, we can compare the two measures when deciding between the user-based and the item-based prediction model, as their prediction ability
AC
can differ. This is useful only if the measures can be computed quickly, which is the case for the proposed measures.
5
ACCEPTED MANUSCRIPT
3. Required properties of the measure Let ru,i denote the rating of user u on item i. Let r¯u denote the average
CR IP T
rating of user u and let r¯i denote the average rating of item i. The similarity of users u, v is usually calculated as a Pearson Correlation Coefficient (PCC): P
− r¯u )(rv,i − r¯v ) qP , 2 2 (r − r ¯ ) (r − r ¯ ) u,i u v,i v i∈Iu,v i∈Iu,v
simu,v = qP
i∈Iu,v (ru,i
(1)
AN US
where Iu,v is the set of items rated by both users u and v. Analogously, the item similarity is calculated as
− r¯i )(ru,j − r¯j ) qP , 2 2 (r − r ¯ ) (r − r ¯ ) u,i i u,j j u∈Ui,j u∈Ui,j u∈Ui,j (ru,i
(2)
M
simi,j = qP
P
where Ui,j is the set of users who have rated both items i and j.
ED
The standard CF approaches based on PCC are the user-based (UPCC) and item-based (IPCC) predictions [10]. To predict the rating ru,i , the user-
PT
based approach searches for the positive similarities between the current user u and other users v who have rated item i. The prediction is then calculated
AC
CE
using the weighted average of their ratings:
UPCC u,i = r¯u +
P
v
simu,v (rv,i − r¯v ) P . v simu,v
(3)
The item-based prediction works analogously, taking into account the
current user’s ratings for items j with positive similarity to the current item i: 6
ACCEPTED MANUSCRIPT
IPCC u,i = r¯i +
P
j
simi,j (ru,j − r¯j ) P . j simi,j
(4)
CR IP T
It is clear that high similarities among users and items imply better prediction ability, not only when using PCC-based CF approaches. We therefore attempt to estimate the dataset quality using real numbers GUCM (Global
User Correlation Measure) and GICM (Global Item Correlation Measure)
AN US
with the following properties:
• GUCM, GICM∈ [0, 1] with a low value corresponding to low similarities and a high value corresponding to high similarities.
• GUCM and GICM should correlate with prediction accuracy, i.e., higher
M
GUCM/GICM should imply higher Top-N precision rates.
ED
• GUCM and GICM should be close to 0 on artificial random datasets (with random real-valued ratings).
PT
• GUCM should be close to 1 on a dataset where all users are like-minded. Analogously for GICM and items.
CE
• GUCM should negatively correlate with the number of natural clusters of similar users, i.e., with the variability of user types. Analogously for
AC
GICM and items.
The following section gives definitions of GUCM and GICM with the
algorithm for their calculation. The properties described above are confirmed in the experimental results, Sect. 5 and 6. 7
ACCEPTED MANUSCRIPT
4. Defining and calculating the measures As the definitions and the calculation algorithms for GUCM and GICM
CR IP T
are completely analogous (symmetrical), we will describe GUCM in detail; GICM is then easily defined and calculated by switching the roles of ”users” and ”items” (for example, by transposing the user-item matrix).
The basic idea is that if there are groups of like-minded users, this will
AN US
reflect in their ratings of a single item: the ratings of this item will not be uniformly distributed, but will be divisible into groups of mutually close
ratings. We will therefore focus on the ratings distribution at item level and estimate its divergence from the uniform distribution.
M
Namely, looking at a single item i, we estimate the degree of users’ agreement on the item rating as a real number in [0, 1] as follows. We first compute
ED
the number of close pairs, denoted by CPi , which is the number of user pairs whose normalized ratings (with the average subtracted) differ by less than a
PT
certain threshold T :
(5)
CE
CPi = |{{u, v} ⊆ Ui : |(ru,i − r¯u ) − (rv,i − r¯v )| < T }|,
where Ui is the set of all users who have rated item i. If their number is
AC
|Ui | = Ki , the percentage of close pairs, denoted by pi , equals the number of
close pairs divided by the number of all considered pairs:
pi =
CPi . Ki (Ki − 1)/2 8
(6)
ACCEPTED MANUSCRIPT
This represents the probability that a random pair of users have close ratings on item i. Since some close ratings naturally appear in random datasets as
CR IP T
well, and since the threshold value T has a direct impact on pi , we will ”correct” pi by subtracting the expected value of the same probability for
the random distribution, i.e., the probability that two uniformly distributed
random ratings from the same range differ by less than T . This probability equals T (2D − T ) , D2
AN US
q=
(7)
where D is the given range, i.e., the difference between the maximum and the minimum value in the dataset: D = rmax − rmin . The proof of (7) is
M
given in Appendix A.
We therefore define the user agreement degree on item i, denoted by
ED
U ADi , as pi − q, i.e., the difference between the percentage of close pairs and the expected percentage for the random dataset. In the unlikely case that
CE
PT
pi − q < 0, we set the value to zero: U ADi = max{pi − q, 0} ∈ [0, 1].
(8)
Finally, GUCM is defined as the weighted average of user agreements across
AC
items, with the weight wi equal to Ki (the amount of data used in calculating U ADi ): GU CM =
PM
Ki · U ADi ∈ [0, 1], PM i=1 Ki
i=1
9
(9)
ACCEPTED MANUSCRIPT
where M is the total number of items. We propose to set the aforementioned ”closeness” threshold Tuser to be the standard deviation of the
CR IP T
user averages r¯u , because a small standard deviation implies similar subjective scales and, therefore, smaller differences between close pairs. We have
also tried other threshold values and found that the proposed choice gives meaningful results in practice.
AN US
4.1. The algorithm
In order to compute CPi efficiently, we first sort the known ratings of item i as an increasing sequence a1 , . . . , aKi . We want to find the number of pairs (k, l) such that k < l and al − ak < T . For a fixed k, we find
M
the largest possible l still within the threshold, i.e., the largest al less than ak + T , and then the number of pairs for this fixed k equals l − k because
ED
all pairs (k, k + 1), (k, k + 2), . . . , (k, l) are close. When moving to the next k (i.e., increasing k by one), we update l, but it obviously increases or stays
PT
the same. This means that, after initial sorting in O(Ki log Ki ), the time complexity of calculating CPi is O(Ki ) as we only move the indices k and l
CE
from the beginning to the end of the sorted ratings sequence. The complete GUCM algorithm is given in Alg. 1 and computing GICM 1
AC
is entirely symmetrical (users ←→ items).
Since calculations of U ADi are mutually independent (as well as calcula-
tions of r¯u ), they can be computed concurrently. 1
Implementation can be found on https://bitbucket.org/satja/gcm/src.
10
ACCEPTED MANUSCRIPT
ED
M
AN US
CR IP T
Algorithm 1: The algorithm for calculating GUCM Data: User-item ratings ru,i (u ∈ {1, . . . , N }, i ∈ {1, . . . , M }) D := rmax − rmin for u = 1 to N independently do r¯u := average{ru,i : u has rated i} end T := standardDeviation{¯ ru : u = 1, . . . , N } 2 q := T (2D − T )/D for i = 1 to M independently do Ki := |{u : u has rated i}| (a1 , a2 , . . . , aKi ) := sorted({ru,i : u has rated i}) CPi := 0 l := 1 for k = 1 to Ki do while l < Ki and al+1 − ak < T do l := l + 1 end CPi := CPi + l − k end pi := CPi /(Ki (Ki − 1)/2) U ADi := max{pi − q, 0} end PM P Return GU CM := ( M i=1 Ki ) i=1 Ki · U ADi )/(
PT
5. Results on synthetic data
CE
5.1. Random categorical sets In this experiment, we have randomly generated several categorical user-
AC
item matrices of size 5000 × 5000 with rating values in {0, 1, . . . , K − 1}, where K is the number of categories (possible rating values). The results are given in Fig. 1. We can see that a higher number of categories results in a lower measure, which is a desired property because increasing the number of possible rating values increases the randomness (”entropy”) in the dataset, 11
ACCEPTED MANUSCRIPT
making the users (or items) less likely to be similar. Also, as expected, the measure approaches the value of 0 for highly random datasets.
CR IP T
0.25
GUCM GICM
0.20 0.15
0.05 0.00
2
4
AN US
0.10
8 16 Number of categories
32
M
Figure 1: Random categorical sets, N = M = 5000
ED
5.2. Artificial sets with natural clusters To show that the measure is roughly proportional to the amount of agree-
PT
ment in the dataset, we have created several complete 5000 × 5000 useritem matrices and filled them with random integer values from [0, 10000]
CE
so that there are K clusters of equal users, where K was varied (K = 1, 2, . . . , 10, 20, . . . , 90, 100). The results are given in Fig. 2. As expected,
AC
GU CM ≈ 1 when all users agree, and GUCM decreases as the number of nat-
ural clusters (user variability) increases; moreover, GU CM ≈ 1/K. Since we only clustered the users, GICM is lower than GUCM as expected. High correlation between users implies some correlation between items, which explains why GICM is also negatively correlated with the number of user clusters. 12
ACCEPTED MANUSCRIPT
1.0
GUCM GICM
CR IP T
0.8 0.6 0.4
0.0
AN US
0.2
1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100 Number of clusters of equal users Figure 2: Artificial sets with natural user clusters, N = M = 5000
M
6. Results on real-world data
ED
For each of the 13 considered real-world datasets (see Table 1), we have randomly divided the data into training and testing sets (ratio 90:10) and the
PT
GUCM/GICM measures for each dataset were computed using the training set only. We have compared these measures with two other, computationally
CE
much more demanding measures. The first was the percentage of pairs with (PCC) similarity higher than 0.5, called similar user pairs and similar item
AC
pairs. The second was the average similarity of the pairs which have at least one common2 rating (this measure is an ingredient of Redundancy measure3 2
i.e., different by at most 10% of the rating scale. The Redundancy measure itself, having other ingredients besides similarity and being made for a different purpose, did not correlate with the proposed measures. 3
13
ACCEPTED MANUSCRIPT
items 49 2071 16121 139738 48794 1682 3706 150 17770 9927 11916 340541 29564
ratings 2450 35497 72665 664824 8196077 100000 1000209 1728830 10375459 6337241 211231 1149763 656276
range [0.0, 1.0] 0.5, 1, . . . , 4 1, 2, 3, 4 1, 2, 3, 4 0.5, 1, . . . , 5 1, 2, . . . , 5 1, 2, . . . , 5 [-10.0, 10.0] 1, 2, . . . , 5 1, 2, . . . , 10 1, 2, . . . , 13 0, 1, . . . , 10 0, 1, . . . , 10
reference [11] [12] [13] [14]
CR IP T
users 50 1508 17615 40163 147612 943 6040 50691 50000 69600 7642 105281 51503
AN US
Dataset Amazon QoSa FilmTrustb Ciaoc Epinionsd Flixtere MovieLens 100Kf MovieLens 1Mg Jesterh Netflixi Animej YahooMoviesk BookCrossingl MovieTweetingsm
[15]
[16] [17]
Table 1: Real-world datasets description a
ED
M
http://ccl.fer.hr/people/research-assistants/marin-silic/clus-evaluation-dataset b http://www.librec.net/datasets/filmtrust.zip c http://www.librec.net/datasets/CiaoDVD.zip d http://www.trustlet.org/downloaded epinions.html e http://www.cs.ubc.ca/ jamalim/datasets f http://grouplens.org/datasets/movielens/100k g http://grouplens.org/datasets/movielens/1m h http://eigentaste.berkeley.edu/dataset i http://www.netflixprize.com j https://www.kaggle.com/CooperUnion/anime-recommendations-database k https://webscope.sandbox.yahoo.com l http://www2.informatik.uni-freiburg.de/∼cziegler/BX/ m https://github.com/sidooms/MovieTweetings
[9]) which we called user similarity degree and item similarity degree.
PT
We have also implemented standard user-based (UPCC) and item-based (IPCC) collaborative filtering approaches based on similar users/items as de-
CE
scribed in Sect. 3, using the training set data to predict the values in the testing set, as well as the predictions based on Matrix Factorization (MF)
AC
algorithm using stochastic gradient descent [18], where the hyperparameters were learned from the training set using cross-validation. The accuracy was defined using relevance of the items with highest (Top 10) ratings for a given user; namely, we used Normalized Discounted Cumulative Gain
14
ACCEPTED MANUSCRIPT
(nDCG) measure [19, 20]. GUCM 0.078 0.068 0.068 0.109 0.162 0.142 0.178 0.159 0.223 0.267 0.250 0.377 0.684
GICM 0.109 0.121 0.124 0.115 0.141 0.213 0.212 0.235 0.255 0.287 0.342 0.400 0.650
sim. user pairs 0.155 0.201 0.179 0.001 0.146 0.147 0.012 0.307 0.045 0.001 0.206 0.019 1.000
sim. item pairs 0.285 0.168 0.211 0.002 0.031 0.028 0.001 0.029 0.282 0.005 0.208 0.015 0.989
user sim. deg. 0.161 0.215 0.171 0.513 0.144 0.113 0.647 0.435 0.161 0.268 0.266 0.299 0.988
item sim. deg. 0.469 0.199 0.276 0.328 0.347 0.246 0.296 0.300 0.614 0.226 0.342 0.268 0.877
UPCC nDCG 0.902 0.909 0.914 0.882 0.955 0.958 0.968 0.971 0.943 0.988 0.961 0.983 0.999
IPCC nDCG 0.904 0.868 0.920 0.950 0.941 0.979 0.980 0.971 0.971 0.983 0.947 0.976 0.993
MF nDCG 0.889 0.910 0.905 0.878 0.948 0.955 0.969 0.968 0.940 0.987 0.957 0.981 0.999
CR IP T
dataset Netflix MovieLens1M MovieLens100K BookCrossing Jester FilmTrust Epinions YahooMovies Flixter CiaoDVD Anime MovieTweetings AmazonQoS
AN US
Table 2: Numerical results for the real-world datasets
proposed metric
other metric
Pearson corr. coef.
p-value
user-based item-based
GUCM GICM
similar user pairs similar item pairs
0.705 0.679
0.007 0.011
user-based item-based
GUCM GICM
user similarity degree item similarity degree
0.689 0.633
0.009 0.020
user-based item-based general
GUCM GICM (GUCM+GICM)/2
UPCC Top-10 nDCG IPCC Top-10 nDCG MF Top-10 nDCG
0.728 0.622 0.751
0.005 0.023 0.003
M
perspective
ED
Table 3: Correlations between GUCM/GICM and other prediction ability indicators across 13 real-world datasets
PT
Numerical results are given in Table 2 and the correlations between the proposed measures and other prediction ability indicators (i.e., between
CE
columns of Table 2) are given in Table 3. For each row in Table 3, Pearson correlation coefficient (in range [−1, 1]) was calculated for two 13-element
AC
vectors, each containing the values of a given metric for all datasets. p-value is the probability that the obtained correlation coefficients were accidental. Relatively high coefficients (0.62 − 0.75) and low p-values confirm the relation between the proposed measures and other prediction ability indicators, showing that higher GUCM and GICM tend to imply more similarities and 15
ACCEPTED MANUSCRIPT
1.00
0.6
0.98
0.5
0.96
0.4
0.94
0.3
CR IP T
GUCM
UPCC nDCG
0.7
0.92
0.2
0.90
0.1
0.88
0.0 t r r ens1M ns100KNetflix ookCrossinFgilmTrusYahooMovies Jeste Epinions Flixte B MovieL MovieLe
0.86 Anime CiaoDVD ieTweetinmgas zon QoS A v o M
0.7 0.6 0.5
GICM
0.4 0.3
M
0.2 0.1
0.98 0.96 0.94 0.92 0.90 0.88
0.86 Netflix ookCrossinogvieLens1MvieLens100KJester Epinions FilmTrustahooMovies Flixter CiaoDVD Anime ieTweetinmgas zon QoS A Y v B M Mo Mo
ED
0.0
1.00
IPCC nDCG
AN US
Figure 3: GUCM (black) and user-based Top-10 precison (red)
Figure 4: GICM (black) and item-based Top-10 precision (red) 1.00
PT
0.7
(GUCM + GICM) / 2
0.5
0.96 0.94
CE
0.4
0.98
MF nDCG
0.6
0.92
0.2
0.90
0.1
0.88
AC
0.3
0.0
Netflix ovieLens1MvieLens10o0oKkCrossing Jester FilmTrust Epinions ahooMovies Flixter CiaoDVD Anime ieTweetinmgas zon QoS A Y B M Mo Mov
Figure 5: (GUCM+GICM)/2 (black) and MF-based Top-10 precision (red)
16
0.86
ACCEPTED MANUSCRIPT
greater prediction ability. metric
prediction accuracy
Pearson corr. coef.
p-value
user-based item-based general
similar user pairs similar item pairs (sim. user pairs + sim. item pairs)/2
UPCC Top-10 nDCG IPCC Top-10 nDCG MF Top-10 nDCG
0.356 0.088 0.285
0.233 0.775 0.345
user-based item-based general
user similarity degree item similarity degree (user sim. deg. + item sim. deg.)/2
UPCC Top-10 nDCG IPCC Top-10 nDCG MF Top-10 nDCG
0.371 0.299 0.360
0.212 0.320 0.227
CR IP T
perspective
Table 4: Correlations between existing (less efficient) measures and prediction accuracies across 13 real-world datasets
AN US
For the sake of comparison, Table 4 presents the results for the alterna-
tive measures similar user/item pairs and user/item similarity degree. As we can see, the correlations between these measures and the prediction accuracies, although expectedly positive, are weaker and less significant than
M
the same correlations for GUCM/GICM measures. This means that GUCM and GICM are better estimators of prediction accuracies, apart from being
ED
computationally much more efficient. Relations between the proposed measures and the precision of prediction
PT
algorithms are illustrated in Fig. 3 (GUCM vs. user-based predictions), Fig. 4 (GICM vs. item-based predictions), and Fig. 5 (average of GUCM and
CE
GICM vs. MF-based predictions). In each figure, black columns represent the proposed measures (scale on the left y-axis), while red columns repre-
AC
sent the Top-10 nDCG precision of the prediction algorithms (scale on the right y-axis). The figures show that usually, when GUCM and GICM are higher, Top-10 predictions are better. There are anomalies since the proposed measures are not more than heuristics, but the following observation
17
ACCEPTED MANUSCRIPT
can be empirically derived from these results: a proposed measure lower than 0.13 usually implies the corresponding Top-10 precision lower than 94%
CR IP T
(supported in 11/12 cases), while a proposed measure higher than 0.14 usually implies the corresponding Top-10 precision of at least 94% (supported
in 27/27 cases). This is illustrated in the figures: the vertical line separates GCM measures lower and higher than 0.135, while the horizontal line
AN US
separates precisions lower and higher than 94%. 7. Conclusion and further study
The user-based and item-based Global Correlation Measures (GCM) have been empirically shown to satisfy several desirable requirements, such as to
M
correlate with the amount of user-user and item-item similarities, as well as with the prediction accuracy, to approach a value of 1 on datasets with equal
ED
users and a value of 0 on datasets with random ratings, and to negatively correlate with the number of natural clusters of similar users/items. Since
PT
the proposed measures can be computed efficiently, in time proportional to
CE
the number of known ratings, they can be useful when deciding whether to apply CF on a given dataset – the above results suggest that a GCM value
AC
higher than 0.14 tends to imply a high prediction accuracy. In case one needs to choose whether to include some additional or questionable records in a dataset, GCM can be used to estimate if the dataset quality would increase or decrease with a considered change. GCM can also be used to choose or differentiate between multiple datasets in various educational and 18
ACCEPTED MANUSCRIPT
research purposes, such as evaluating a new CF model. In further study, the proposed measures could be refined to take into account semantic domain
Acknowledgment
CR IP T
knowledge [21], fuzziness [22], and multirelational networks [23].
The authors acknowledge the support of the Croatian Science Foundation through the Recommender System for Service-oriented Architecture
AN US
(IP-2014-09-9606) research project. References
M
[1] R. Y. Wang, D. M. Strong, Beyond accuracy: What data quality means to data consumers, J. Manage. Inf. Syst. 12 (4) (1996) 5–33. doi:10.1080/07421222.1996.11518099. URL http://dx.doi.org/10.1080/07421222.1996.11518099
PT
ED
[2] L. L. Pipino, Y. W. Lee, R. Y. Wang, Data quality assessment, Commun. ACM 45 (4) (2002) 211–218. doi:10.1145/505248.506010. URL http://doi.acm.org/10.1145/505248.506010
AC
CE
[3] X. Amatriain, J. M. Pujol, N. Tintarev, N. Oliver, Rate it again: Increasing recommendation accuracy by user re-rating, in: Proceedings of the Third ACM Conference on Recommender Systems, RecSys ’09, ACM, New York, NY, USA, 2009, pp. 173–180. doi:10.1145/1639714.1639744. URL http://doi.acm.org/10.1145/1639714.1639744 [4] X. Amatriain, J. M. Pujol, N. Oliver, I Like It... I Like It Not: Evaluating User Ratings Noise in Recommender Systems, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 247–258. doi:10.1007/978-3642-02247-0 24. URL http://dx.doi.org/10.1007/978-3-642-02247-0\_24 19
ACCEPTED MANUSCRIPT
CR IP T
[5] A. Said, B. J. Jain, S. Narr, T. Plumbaum, Users and Noise: The Magic Barrier of Recommender Systems, Springer Berlin Heidelberg, Berlin, Heidelberg, 2012, pp. 237–248. doi:10.1007/978-3-642-31454-4 20. URL http://dx.doi.org/10.1007/978-3-642-31454-4\_20
[6] M. Grˇcar, D. Mladeniˇc, B. Fortuna, M. Grobelnik, Data Sparsity Issues in the Collaborative Filtering Framework, Springer Berlin Heidelberg, Berlin, Heidelberg, 2006, pp. 58–76. doi:10.1007/11891321 4. URL http://dx.doi.org/10.1007/11891321\_4
M
AN US
[7] B. M. Marlin, R. S. Zemel, S. Roweis, M. Slaney, Collaborative filtering and the missing at random assumption, in: Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence, UAI’07, AUAI Press, Arlington, Virginia, United States, 2007, pp. 267– 275. URL http://dl.acm.org/citation.cfm?id=3020488.3020521
ED
[8] X. Su, T. M. Khoshgoftaar, A survey of collaborative filtering techniques, Adv. in Artif. Intell. 2009 (2009) 4:2–4:2. doi:10.1155/2009/421425. URL http://dx.doi.org/10.1155/2009/421425
AC
CE
PT
[9] O. Sar Shalom, S. Berkovsky, R. Ronen, E. Ziklik, A. Amihood, Data quality matters in recommender systems, in: Proceedings of the 9th ACM Conference on Recommender Systems, RecSys ’15, ACM, New York, NY, USA, 2015, pp. 257–260. doi:10.1145/2792838.2799670. URL http://doi.acm.org/10.1145/2792838.2799670
[10] J. S. Breese, D. Heckerman, C. Kadie, Empirical analysis of predictive algorithms for collaborative filtering, in: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, UAI’98, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1998, pp. 43–52. URL http://dl.acm.org/citation.cfm?id=2074094.2074100 20
ACCEPTED MANUSCRIPT
CR IP T
[11] M. Silic, G. Delac, I. Krka, S. Srbljic, Scalable and accurate prediction of availability of atomic web services, Services Computing, IEEE Transactions on 7 (2) (2014) 252–264. doi:10.1109/TSC.2013.3. [12] G. Guo, J. Zhang, N. Yorke-Smith, A novel bayesian similarity measure for recommender systems, in: Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI), 2013, pp. 2619–2625.
AN US
[13] G. Guo, J. Zhang, D. Thalmann, N. Yorke-Smith, Etaf: An extended trust antecedents framework for trust prediction, in: Proceedings of the 2014 International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2014, pp. 540–547.
M
[14] P. Massa, P. Avesani, Trust-aware recommender systems, in: Proceedings of the 2007 ACM conference on Recommender systems, 2007, pp. 17–24.
ED
[15] K. Goldberg, T. Roeder, D. Gupta, C. Perkins, Eigentaste: A constant time collaborative filtering algorithm, Information Retrieval 4 (2) (2001) 133–151. doi:10.1023/A:1011419012209. URL http://dx.doi.org/10.1023/A:1011419012209
CE
PT
[16] C.-N. Ziegler, S. M. McNee, J. A. Konstan, G. Lausen, Improving recommendation lists through topic diversification, in: Proceedings of the 14th International Conference on World Wide Web, WWW ’05, ACM, New York, NY, USA, 2005, pp. 22–32. doi:10.1145/1060745.1060754. URL http://doi.acm.org/10.1145/1060745.1060754
AC
[17] S. Dooms, T. De Pessemier, L. Martens, Movietweetings: a movie rating dataset collected from twitter, in: Workshop on Crowdsourcing and Human Computation for Recommender Systems, held in conjunction with the 7th ACM Conference on Recommender Systems, 2013, p. 2.
21
ACCEPTED MANUSCRIPT
CR IP T
[18] G. Tak´acs, I. Pil´aszy, B. N´emeth, D. Tikk, Scalable collaborative filtering approaches for large recommender systems, J. Mach. Learn. Res. 10 (2009) 623–656. URL http://dl.acm.org/citation.cfm?id=1577069.1577091 [19] K. J¨arvelin, J. Kek¨al¨ainen, Cumulated gain-based evaluation of ir techniques, ACM Trans. Inf. Syst. 20 (4) (2002) 422–446. doi:10.1145/582415.582418. URL http://doi.acm.org/10.1145/582415.582418
M
AN US
[20] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, G. Hullender, Learning to rank using gradient descent, in: Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, ACM, New York, NY, USA, 2005, pp. 89–96. doi:10.1145/1102351.1102363. URL http://doi.acm.org/10.1145/1102351.1102363
PT
ED
[21] Q. Shambour, J. Lu, A trust-semantic fusion-based recommendation approach for e-business applications, Decision Support Systems 54 (1) (2012) 768 – 780. doi:https://doi.org/10.1016/j.dss.2012.09.005. URL http://www.sciencedirect.com/science/article/pii/ S0167923612002400
AC
CE
[22] Z. Zhang, H. Lin, K. Liu, D. Wu, G. Zhang, J. Lu, A hybrid fuzzy-based personalized recommender system for telecom products/services, Information Sciences 235 (2013) 117 – 129, data-based Control, Decision, Scheduling and Fault Diagnostics. doi:https://doi.org/10.1016/j.ins.2013.01.025. URL http://www.sciencedirect.com/science/article/pii/ S0020025513000820 [23] M. Mao, J. Lu, G. Zhang, J. Zhang, Multirelational social recommenda-
22
ACCEPTED MANUSCRIPT
tions via multigraph ranking, IEEE Transactions on Cybernetics 47 (12) (2017) 4049–4061. doi:10.1109/TCYB.2016.2595620.
CR IP T
Appendix A. Proof of Eq. (7)
Here we prove Eq. (7) which gives the probability that two random values from the uniform distribution on the interval of length D differ by less than T . If we rescale the interval to become [0, 1], the threshold difference should
AN US
also be rescaled by factor 1/D, i.e., we are looking for the probability that the difference between two random values in [0, 1] is less than T /D.
More precisely, let us assume that X and Y are independent uniformly distributed random variables from [0, 1]. We are looking for the probability
M
density function of their difference X − Y ∈ [−1, 1], which is given by the convolution of the corresponding separate density functions:
ED
Z
PT
fX−Y (x) = (fX ∗ f−Y ) (x) = fX (y)f−Y (x − y) dy Z Z Z 1 1[−1,0] (x − y) dy = = 1[0,1] (y)1[−1,0] (x − y) dy = 0
x
1[−1,0] (t) dt
x−1
CE
= min{x, 0} − max{x − 1, −1} = min{x, 0} − (max{x, 0} − 1)
AC
= 1 − |x|.
Therefore, Pr[−T /D ≤ X − Y ≤ T /D] equals Z
T /D
−T /D
(1 − |t|) dt = 2
Z
0
T /D
T /D T (2D − T ) x2 T T2 = (1 − t) dt = 2 x − =2 − , 2 D 2D2 D2 0 23
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
which proves Eq. (7).
24