Efficient global correlation measures for a collaborative filtering dataset

Efficient global correlation measures for a collaborative filtering dataset

Accepted Manuscript Efficient Global Correlation Measures for a Collaborative Filtering Dataset Adrian Satja Kurdija, Marin Silic, Klemo Vladimir, Go...

588KB Sizes 0 Downloads 42 Views

Accepted Manuscript

Efficient Global Correlation Measures for a Collaborative Filtering Dataset Adrian Satja Kurdija, Marin Silic, Klemo Vladimir, Goran Delac PII: DOI: Reference:

S0950-7051(18)30065-0 10.1016/j.knosys.2018.02.013 KNOSYS 4221

To appear in:

Knowledge-Based Systems

Received date: Revised date: Accepted date:

21 July 2017 25 January 2018 6 February 2018

Please cite this article as: Adrian Satja Kurdija, Marin Silic, Klemo Vladimir, Goran Delac, Efficient Global Correlation Measures for a Collaborative Filtering Dataset, Knowledge-Based Systems (2018), doi: 10.1016/j.knosys.2018.02.013

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Efficient Global Correlation Measures for a Collaborative Filtering Dataset

CR IP T

Adrian Satja Kurdija, Marin Silic, Klemo Vladimir, and Goran Delac University of Zagreb, Faculty of Electrical Engineering and Computing, Consumer Computing Lab, Unska 3, 10000 Zagreb, Croatia

AN US

Abstract

Recommender systems based on collaborative filtering (CF) rely on datasets containing users’ taste preferences for various items. Accuracy of various prediction approaches depends on the amount of similarity between users and

M

items in a dataset. As a heuristic estimate of this data quality aspect, which could serve as an indicator of the prediction ability, we define the Global

ED

User Correlation Measure (GUCM) and the Global Item Correlation Measure (GICM) of a dataset containing known user-item ratings. The proposed

PT

measures range from 0 to 1 and describe the quality of the dataset regarding the user-user and item-item similarities: a higher measure indicates more

CE

similar pairs and better prediction ability. The experiments show a correlation between the proposed measures and the accuracy of standard prediction

AC

models. The measures can be used to quickly estimate whether a dataset is suitable for collaborative filtering and whether we can expect high prediction accuracy of user-based or item-based CF approaches. Keywords: collaborative filtering, dataset quality, global correlation, user Preprint submitted to Knowledge-Based Systems

February 13, 2018

ACCEPTED MANUSCRIPT

similarity, item similarity

CR IP T

1. Introduction Recommender systems are based on predictions, which rely on datasets

containing known values. In collaborative filtering (CF), datasets contain known ratings corresponding to user-item pairs. CF models are evaluated

on various datasets, and results of a single model across different datasets

AN US

are highly variable, since datasets typically exhibit different properties and yield different prediction accuracies. This introduces the need to measure the quality of a dataset which would be an indication of its prediction difficulty. We can use this measure as a reference point when proposing and evaluating

M

a prediction model on a certain dataset. Developing a prediction model is expensive and time consuming, and quick heuristic indicators of potential

ED

prediction accuracy on a dataset can be very valuable; for example, when deciding which additional data to take into account. For instance, an e-

PT

commerce provider can use this approach when designing the CF process. Specifically, the item vector can be structured in a way to take into account

CE

newly collected data, e.g. in case a new type of rating is introduced. The quality measure can be used to evaluate how the new dataset would behave

AC

if collaborative filtering was to be extended to these new attributes. In effect, the measure would estimate the level of maturity that the available collected data would give to the newly designed CF process. In other words, knowing the proposed measure of a dataset could help us predict whether 2

ACCEPTED MANUSCRIPT

high prediction accuracy can be expected. Other potential uses can be found for educational and scientific purposes.

CR IP T

It is often the case when examining a new approach that the required specific dataset is not available. In such cases, the researchers often rely on synthetically generated data. The proposed measure could be used to assess if the generated dataset resembles a realistic CF dataset (is it suitable for

making predictions) or its values behave as if they were random, and if so,

the artificial dataset generators.

AN US

to which degree. Consequently, the measure can be used to further fine-tune

To make predictions, all CF models explicitly or implicitly rely on similarity between users and/or similarity between items. Prediction accuracy,

M

and therefore the reliability of a recommender system, depends on the degree of correlation of datasets’ values, which is opposed to randomness and

ED

noise. To the best of our knowledge, there are not many global correlationbased measures of a dataset quality, and existing ones include calculation

PT

of many user-user or item-item similarities, which is not more efficient then

CE

calculating all predictions. Our work presents the following contributions: • We define two novel heuristic measures of a CF dataset quality, de-

AC

signed to quickly estimate the amount of global user correlation and item correlation.

• Using comprehensive experiments on synthetic and real-world data, we show that the proposed measures satisfy several desirable properties

3

ACCEPTED MANUSCRIPT

such as correlating with the the accuracy of standard prediction models.

CR IP T

• We derive several applicable principles from the results. The rest of the paper is structured as follows. Related work is discussed in Sect. 2. Sect. 3 gives the formal preliminaries and defines the desired properties of the proposed measures. Sect. 4 gives precise definitions and

the algorithm for calculating the measures. Sect. 5 and Sect. 6 describe the

AN US

experimental results on artificial and real-world datasets. Conclusions are given in Sect. 7. 2. Related work

M

Data quality has multiple aspects, discussed in e.g. [1] and [2]. Natural noise, based on user inconsistency, was discussed in [3], [4], and showed

ED

to constrain the prediction ability beyond some maximal (”magic barrier”) value [5]. Another aspect is missing data, which is related to dataset spar-

PT

sity, specifically discussed in [6]. However, sparsity is information about the amount of available data, not about its properties; for example, it does

CE

not capture whether there are many (or any) users with similar preferences. Marlin et al. [7] have analyzed which data is missing, showing and utilizing

AC

non-randomness of the missing data distribution. User-user and item-item correlations are usually calculated using Pear-

son Correlation (PCC) and less often using vector cosine, Spearman rank or Kendall’s τ correlation [8]. However, these measures are defined on the 4

ACCEPTED MANUSCRIPT

level on individual users/items only. Computing all pairwise similarities to count the number of similar pairs (e.g., with P CC > 0.5) is infeasible for

CR IP T

large datasets. Recently, [9] measured sparsity and redundancy for setting the sampling levels in a CF model. Redundancy was defined using the number of users, items, and ratings, as well as the average pairwise similarity of

users having at least one jointly rated item. As it is the case for the number of similar pairs, this measure is computationally demanding as it poten-

AN US

tially calculates O(N 2 ) user-user similarities, each of which takes at least O(ratings per user) time, which gives a very impractical time complexity for datasets with e.g. 105 users.

Our approach attempts to obtain a related global measure more efficiently.

M

It is based on user-user and item-item similarities, assuming that in a ”highquality” dataset there are groups of highly correlated users (we can predict

ED

the behavior of one based on the others) and groups of highly correlated items (we can predict the ratings of one based on the others). As these are,

PT

in general, two distinct properties, we are actually proposing two measures: one for the user correlation degree and one for the item correlation degree. As

CE

a consequence, we can compare the two measures when deciding between the user-based and the item-based prediction model, as their prediction ability

AC

can differ. This is useful only if the measures can be computed quickly, which is the case for the proposed measures.

5

ACCEPTED MANUSCRIPT

3. Required properties of the measure Let ru,i denote the rating of user u on item i. Let r¯u denote the average

CR IP T

rating of user u and let r¯i denote the average rating of item i. The similarity of users u, v is usually calculated as a Pearson Correlation Coefficient (PCC): P

− r¯u )(rv,i − r¯v ) qP , 2 2 (r − r ¯ ) (r − r ¯ ) u,i u v,i v i∈Iu,v i∈Iu,v

simu,v = qP

i∈Iu,v (ru,i

(1)

AN US

where Iu,v is the set of items rated by both users u and v. Analogously, the item similarity is calculated as

− r¯i )(ru,j − r¯j ) qP , 2 2 (r − r ¯ ) (r − r ¯ ) u,i i u,j j u∈Ui,j u∈Ui,j u∈Ui,j (ru,i

(2)

M

simi,j = qP

P

where Ui,j is the set of users who have rated both items i and j.

ED

The standard CF approaches based on PCC are the user-based (UPCC) and item-based (IPCC) predictions [10]. To predict the rating ru,i , the user-

PT

based approach searches for the positive similarities between the current user u and other users v who have rated item i. The prediction is then calculated

AC

CE

using the weighted average of their ratings:

UPCC u,i = r¯u +

P

v

simu,v (rv,i − r¯v ) P . v simu,v

(3)

The item-based prediction works analogously, taking into account the

current user’s ratings for items j with positive similarity to the current item i: 6

ACCEPTED MANUSCRIPT

IPCC u,i = r¯i +

P

j

simi,j (ru,j − r¯j ) P . j simi,j

(4)

CR IP T

It is clear that high similarities among users and items imply better prediction ability, not only when using PCC-based CF approaches. We therefore attempt to estimate the dataset quality using real numbers GUCM (Global

User Correlation Measure) and GICM (Global Item Correlation Measure)

AN US

with the following properties:

• GUCM, GICM∈ [0, 1] with a low value corresponding to low similarities and a high value corresponding to high similarities.

• GUCM and GICM should correlate with prediction accuracy, i.e., higher

M

GUCM/GICM should imply higher Top-N precision rates.

ED

• GUCM and GICM should be close to 0 on artificial random datasets (with random real-valued ratings).

PT

• GUCM should be close to 1 on a dataset where all users are like-minded. Analogously for GICM and items.

CE

• GUCM should negatively correlate with the number of natural clusters of similar users, i.e., with the variability of user types. Analogously for

AC

GICM and items.

The following section gives definitions of GUCM and GICM with the

algorithm for their calculation. The properties described above are confirmed in the experimental results, Sect. 5 and 6. 7

ACCEPTED MANUSCRIPT

4. Defining and calculating the measures As the definitions and the calculation algorithms for GUCM and GICM

CR IP T

are completely analogous (symmetrical), we will describe GUCM in detail; GICM is then easily defined and calculated by switching the roles of ”users” and ”items” (for example, by transposing the user-item matrix).

The basic idea is that if there are groups of like-minded users, this will

AN US

reflect in their ratings of a single item: the ratings of this item will not be uniformly distributed, but will be divisible into groups of mutually close

ratings. We will therefore focus on the ratings distribution at item level and estimate its divergence from the uniform distribution.

M

Namely, looking at a single item i, we estimate the degree of users’ agreement on the item rating as a real number in [0, 1] as follows. We first compute

ED

the number of close pairs, denoted by CPi , which is the number of user pairs whose normalized ratings (with the average subtracted) differ by less than a

PT

certain threshold T :

(5)

CE

CPi = |{{u, v} ⊆ Ui : |(ru,i − r¯u ) − (rv,i − r¯v )| < T }|,

where Ui is the set of all users who have rated item i. If their number is

AC

|Ui | = Ki , the percentage of close pairs, denoted by pi , equals the number of

close pairs divided by the number of all considered pairs:

pi =

CPi . Ki (Ki − 1)/2 8

(6)

ACCEPTED MANUSCRIPT

This represents the probability that a random pair of users have close ratings on item i. Since some close ratings naturally appear in random datasets as

CR IP T

well, and since the threshold value T has a direct impact on pi , we will ”correct” pi by subtracting the expected value of the same probability for

the random distribution, i.e., the probability that two uniformly distributed

random ratings from the same range differ by less than T . This probability equals T (2D − T ) , D2

AN US

q=

(7)

where D is the given range, i.e., the difference between the maximum and the minimum value in the dataset: D = rmax − rmin . The proof of (7) is

M

given in Appendix A.

We therefore define the user agreement degree on item i, denoted by

ED

U ADi , as pi − q, i.e., the difference between the percentage of close pairs and the expected percentage for the random dataset. In the unlikely case that

CE

PT

pi − q < 0, we set the value to zero: U ADi = max{pi − q, 0} ∈ [0, 1].

(8)

Finally, GUCM is defined as the weighted average of user agreements across

AC

items, with the weight wi equal to Ki (the amount of data used in calculating U ADi ): GU CM =

PM

Ki · U ADi ∈ [0, 1], PM i=1 Ki

i=1

9

(9)

ACCEPTED MANUSCRIPT

where M is the total number of items. We propose to set the aforementioned ”closeness” threshold Tuser to be the standard deviation of the

CR IP T

user averages r¯u , because a small standard deviation implies similar subjective scales and, therefore, smaller differences between close pairs. We have

also tried other threshold values and found that the proposed choice gives meaningful results in practice.

AN US

4.1. The algorithm

In order to compute CPi efficiently, we first sort the known ratings of item i as an increasing sequence a1 , . . . , aKi . We want to find the number of pairs (k, l) such that k < l and al − ak < T . For a fixed k, we find

M

the largest possible l still within the threshold, i.e., the largest al less than ak + T , and then the number of pairs for this fixed k equals l − k because

ED

all pairs (k, k + 1), (k, k + 2), . . . , (k, l) are close. When moving to the next k (i.e., increasing k by one), we update l, but it obviously increases or stays

PT

the same. This means that, after initial sorting in O(Ki log Ki ), the time complexity of calculating CPi is O(Ki ) as we only move the indices k and l

CE

from the beginning to the end of the sorted ratings sequence. The complete GUCM algorithm is given in Alg. 1 and computing GICM 1

AC

is entirely symmetrical (users ←→ items).

Since calculations of U ADi are mutually independent (as well as calcula-

tions of r¯u ), they can be computed concurrently. 1

Implementation can be found on https://bitbucket.org/satja/gcm/src.

10

ACCEPTED MANUSCRIPT

ED

M

AN US

CR IP T

Algorithm 1: The algorithm for calculating GUCM Data: User-item ratings ru,i (u ∈ {1, . . . , N }, i ∈ {1, . . . , M }) D := rmax − rmin for u = 1 to N independently do r¯u := average{ru,i : u has rated i} end T := standardDeviation{¯ ru : u = 1, . . . , N } 2 q := T (2D − T )/D for i = 1 to M independently do Ki := |{u : u has rated i}| (a1 , a2 , . . . , aKi ) := sorted({ru,i : u has rated i}) CPi := 0 l := 1 for k = 1 to Ki do while l < Ki and al+1 − ak < T do l := l + 1 end CPi := CPi + l − k end pi := CPi /(Ki (Ki − 1)/2) U ADi := max{pi − q, 0} end PM P Return GU CM := ( M i=1 Ki ) i=1 Ki · U ADi )/(

PT

5. Results on synthetic data

CE

5.1. Random categorical sets In this experiment, we have randomly generated several categorical user-

AC

item matrices of size 5000 × 5000 with rating values in {0, 1, . . . , K − 1}, where K is the number of categories (possible rating values). The results are given in Fig. 1. We can see that a higher number of categories results in a lower measure, which is a desired property because increasing the number of possible rating values increases the randomness (”entropy”) in the dataset, 11

ACCEPTED MANUSCRIPT

making the users (or items) less likely to be similar. Also, as expected, the measure approaches the value of 0 for highly random datasets.

CR IP T

0.25

GUCM GICM

0.20 0.15

0.05 0.00

2

4

AN US

0.10

8 16 Number of categories

32

M

Figure 1: Random categorical sets, N = M = 5000

ED

5.2. Artificial sets with natural clusters To show that the measure is roughly proportional to the amount of agree-

PT

ment in the dataset, we have created several complete 5000 × 5000 useritem matrices and filled them with random integer values from [0, 10000]

CE

so that there are K clusters of equal users, where K was varied (K = 1, 2, . . . , 10, 20, . . . , 90, 100). The results are given in Fig. 2. As expected,

AC

GU CM ≈ 1 when all users agree, and GUCM decreases as the number of nat-

ural clusters (user variability) increases; moreover, GU CM ≈ 1/K. Since we only clustered the users, GICM is lower than GUCM as expected. High correlation between users implies some correlation between items, which explains why GICM is also negatively correlated with the number of user clusters. 12

ACCEPTED MANUSCRIPT

1.0

GUCM GICM

CR IP T

0.8 0.6 0.4

0.0

AN US

0.2

1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100 Number of clusters of equal users Figure 2: Artificial sets with natural user clusters, N = M = 5000

M

6. Results on real-world data

ED

For each of the 13 considered real-world datasets (see Table 1), we have randomly divided the data into training and testing sets (ratio 90:10) and the

PT

GUCM/GICM measures for each dataset were computed using the training set only. We have compared these measures with two other, computationally

CE

much more demanding measures. The first was the percentage of pairs with (PCC) similarity higher than 0.5, called similar user pairs and similar item

AC

pairs. The second was the average similarity of the pairs which have at least one common2 rating (this measure is an ingredient of Redundancy measure3 2

i.e., different by at most 10% of the rating scale. The Redundancy measure itself, having other ingredients besides similarity and being made for a different purpose, did not correlate with the proposed measures. 3

13

ACCEPTED MANUSCRIPT

items 49 2071 16121 139738 48794 1682 3706 150 17770 9927 11916 340541 29564

ratings 2450 35497 72665 664824 8196077 100000 1000209 1728830 10375459 6337241 211231 1149763 656276

range [0.0, 1.0] 0.5, 1, . . . , 4 1, 2, 3, 4 1, 2, 3, 4 0.5, 1, . . . , 5 1, 2, . . . , 5 1, 2, . . . , 5 [-10.0, 10.0] 1, 2, . . . , 5 1, 2, . . . , 10 1, 2, . . . , 13 0, 1, . . . , 10 0, 1, . . . , 10

reference [11] [12] [13] [14]

CR IP T

users 50 1508 17615 40163 147612 943 6040 50691 50000 69600 7642 105281 51503

AN US

Dataset Amazon QoSa FilmTrustb Ciaoc Epinionsd Flixtere MovieLens 100Kf MovieLens 1Mg Jesterh Netflixi Animej YahooMoviesk BookCrossingl MovieTweetingsm

[15]

[16] [17]

Table 1: Real-world datasets description a

ED

M

http://ccl.fer.hr/people/research-assistants/marin-silic/clus-evaluation-dataset b http://www.librec.net/datasets/filmtrust.zip c http://www.librec.net/datasets/CiaoDVD.zip d http://www.trustlet.org/downloaded epinions.html e http://www.cs.ubc.ca/ jamalim/datasets f http://grouplens.org/datasets/movielens/100k g http://grouplens.org/datasets/movielens/1m h http://eigentaste.berkeley.edu/dataset i http://www.netflixprize.com j https://www.kaggle.com/CooperUnion/anime-recommendations-database k https://webscope.sandbox.yahoo.com l http://www2.informatik.uni-freiburg.de/∼cziegler/BX/ m https://github.com/sidooms/MovieTweetings

[9]) which we called user similarity degree and item similarity degree.

PT

We have also implemented standard user-based (UPCC) and item-based (IPCC) collaborative filtering approaches based on similar users/items as de-

CE

scribed in Sect. 3, using the training set data to predict the values in the testing set, as well as the predictions based on Matrix Factorization (MF)

AC

algorithm using stochastic gradient descent [18], where the hyperparameters were learned from the training set using cross-validation. The accuracy was defined using relevance of the items with highest (Top 10) ratings for a given user; namely, we used Normalized Discounted Cumulative Gain

14

ACCEPTED MANUSCRIPT

(nDCG) measure [19, 20]. GUCM 0.078 0.068 0.068 0.109 0.162 0.142 0.178 0.159 0.223 0.267 0.250 0.377 0.684

GICM 0.109 0.121 0.124 0.115 0.141 0.213 0.212 0.235 0.255 0.287 0.342 0.400 0.650

sim. user pairs 0.155 0.201 0.179 0.001 0.146 0.147 0.012 0.307 0.045 0.001 0.206 0.019 1.000

sim. item pairs 0.285 0.168 0.211 0.002 0.031 0.028 0.001 0.029 0.282 0.005 0.208 0.015 0.989

user sim. deg. 0.161 0.215 0.171 0.513 0.144 0.113 0.647 0.435 0.161 0.268 0.266 0.299 0.988

item sim. deg. 0.469 0.199 0.276 0.328 0.347 0.246 0.296 0.300 0.614 0.226 0.342 0.268 0.877

UPCC nDCG 0.902 0.909 0.914 0.882 0.955 0.958 0.968 0.971 0.943 0.988 0.961 0.983 0.999

IPCC nDCG 0.904 0.868 0.920 0.950 0.941 0.979 0.980 0.971 0.971 0.983 0.947 0.976 0.993

MF nDCG 0.889 0.910 0.905 0.878 0.948 0.955 0.969 0.968 0.940 0.987 0.957 0.981 0.999

CR IP T

dataset Netflix MovieLens1M MovieLens100K BookCrossing Jester FilmTrust Epinions YahooMovies Flixter CiaoDVD Anime MovieTweetings AmazonQoS

AN US

Table 2: Numerical results for the real-world datasets

proposed metric

other metric

Pearson corr. coef.

p-value

user-based item-based

GUCM GICM

similar user pairs similar item pairs

0.705 0.679

0.007 0.011

user-based item-based

GUCM GICM

user similarity degree item similarity degree

0.689 0.633

0.009 0.020

user-based item-based general

GUCM GICM (GUCM+GICM)/2

UPCC Top-10 nDCG IPCC Top-10 nDCG MF Top-10 nDCG

0.728 0.622 0.751

0.005 0.023 0.003

M

perspective

ED

Table 3: Correlations between GUCM/GICM and other prediction ability indicators across 13 real-world datasets

PT

Numerical results are given in Table 2 and the correlations between the proposed measures and other prediction ability indicators (i.e., between

CE

columns of Table 2) are given in Table 3. For each row in Table 3, Pearson correlation coefficient (in range [−1, 1]) was calculated for two 13-element

AC

vectors, each containing the values of a given metric for all datasets. p-value is the probability that the obtained correlation coefficients were accidental. Relatively high coefficients (0.62 − 0.75) and low p-values confirm the relation between the proposed measures and other prediction ability indicators, showing that higher GUCM and GICM tend to imply more similarities and 15

ACCEPTED MANUSCRIPT

1.00

0.6

0.98

0.5

0.96

0.4

0.94

0.3

CR IP T

GUCM

UPCC nDCG

0.7

0.92

0.2

0.90

0.1

0.88

0.0 t r r ens1M ns100KNetflix ookCrossinFgilmTrusYahooMovies Jeste Epinions Flixte B MovieL MovieLe

0.86 Anime CiaoDVD ieTweetinmgas zon QoS A v o M

0.7 0.6 0.5

GICM

0.4 0.3

M

0.2 0.1

0.98 0.96 0.94 0.92 0.90 0.88

0.86 Netflix ookCrossinogvieLens1MvieLens100KJester Epinions FilmTrustahooMovies Flixter CiaoDVD Anime ieTweetinmgas zon QoS A Y v B M Mo Mo

ED

0.0

1.00

IPCC nDCG

AN US

Figure 3: GUCM (black) and user-based Top-10 precison (red)

Figure 4: GICM (black) and item-based Top-10 precision (red) 1.00

PT

0.7

(GUCM + GICM) / 2

0.5

0.96 0.94

CE

0.4

0.98

MF nDCG

0.6

0.92

0.2

0.90

0.1

0.88

AC

0.3

0.0

Netflix ovieLens1MvieLens10o0oKkCrossing Jester FilmTrust Epinions ahooMovies Flixter CiaoDVD Anime ieTweetinmgas zon QoS A Y B M Mo Mov

Figure 5: (GUCM+GICM)/2 (black) and MF-based Top-10 precision (red)

16

0.86

ACCEPTED MANUSCRIPT

greater prediction ability. metric

prediction accuracy

Pearson corr. coef.

p-value

user-based item-based general

similar user pairs similar item pairs (sim. user pairs + sim. item pairs)/2

UPCC Top-10 nDCG IPCC Top-10 nDCG MF Top-10 nDCG

0.356 0.088 0.285

0.233 0.775 0.345

user-based item-based general

user similarity degree item similarity degree (user sim. deg. + item sim. deg.)/2

UPCC Top-10 nDCG IPCC Top-10 nDCG MF Top-10 nDCG

0.371 0.299 0.360

0.212 0.320 0.227

CR IP T

perspective

Table 4: Correlations between existing (less efficient) measures and prediction accuracies across 13 real-world datasets

AN US

For the sake of comparison, Table 4 presents the results for the alterna-

tive measures similar user/item pairs and user/item similarity degree. As we can see, the correlations between these measures and the prediction accuracies, although expectedly positive, are weaker and less significant than

M

the same correlations for GUCM/GICM measures. This means that GUCM and GICM are better estimators of prediction accuracies, apart from being

ED

computationally much more efficient. Relations between the proposed measures and the precision of prediction

PT

algorithms are illustrated in Fig. 3 (GUCM vs. user-based predictions), Fig. 4 (GICM vs. item-based predictions), and Fig. 5 (average of GUCM and

CE

GICM vs. MF-based predictions). In each figure, black columns represent the proposed measures (scale on the left y-axis), while red columns repre-

AC

sent the Top-10 nDCG precision of the prediction algorithms (scale on the right y-axis). The figures show that usually, when GUCM and GICM are higher, Top-10 predictions are better. There are anomalies since the proposed measures are not more than heuristics, but the following observation

17

ACCEPTED MANUSCRIPT

can be empirically derived from these results: a proposed measure lower than 0.13 usually implies the corresponding Top-10 precision lower than 94%

CR IP T

(supported in 11/12 cases), while a proposed measure higher than 0.14 usually implies the corresponding Top-10 precision of at least 94% (supported

in 27/27 cases). This is illustrated in the figures: the vertical line separates GCM measures lower and higher than 0.135, while the horizontal line

AN US

separates precisions lower and higher than 94%. 7. Conclusion and further study

The user-based and item-based Global Correlation Measures (GCM) have been empirically shown to satisfy several desirable requirements, such as to

M

correlate with the amount of user-user and item-item similarities, as well as with the prediction accuracy, to approach a value of 1 on datasets with equal

ED

users and a value of 0 on datasets with random ratings, and to negatively correlate with the number of natural clusters of similar users/items. Since

PT

the proposed measures can be computed efficiently, in time proportional to

CE

the number of known ratings, they can be useful when deciding whether to apply CF on a given dataset – the above results suggest that a GCM value

AC

higher than 0.14 tends to imply a high prediction accuracy. In case one needs to choose whether to include some additional or questionable records in a dataset, GCM can be used to estimate if the dataset quality would increase or decrease with a considered change. GCM can also be used to choose or differentiate between multiple datasets in various educational and 18

ACCEPTED MANUSCRIPT

research purposes, such as evaluating a new CF model. In further study, the proposed measures could be refined to take into account semantic domain

Acknowledgment

CR IP T

knowledge [21], fuzziness [22], and multirelational networks [23].

The authors acknowledge the support of the Croatian Science Foundation through the Recommender System for Service-oriented Architecture

AN US

(IP-2014-09-9606) research project. References

M

[1] R. Y. Wang, D. M. Strong, Beyond accuracy: What data quality means to data consumers, J. Manage. Inf. Syst. 12 (4) (1996) 5–33. doi:10.1080/07421222.1996.11518099. URL http://dx.doi.org/10.1080/07421222.1996.11518099

PT

ED

[2] L. L. Pipino, Y. W. Lee, R. Y. Wang, Data quality assessment, Commun. ACM 45 (4) (2002) 211–218. doi:10.1145/505248.506010. URL http://doi.acm.org/10.1145/505248.506010

AC

CE

[3] X. Amatriain, J. M. Pujol, N. Tintarev, N. Oliver, Rate it again: Increasing recommendation accuracy by user re-rating, in: Proceedings of the Third ACM Conference on Recommender Systems, RecSys ’09, ACM, New York, NY, USA, 2009, pp. 173–180. doi:10.1145/1639714.1639744. URL http://doi.acm.org/10.1145/1639714.1639744 [4] X. Amatriain, J. M. Pujol, N. Oliver, I Like It... I Like It Not: Evaluating User Ratings Noise in Recommender Systems, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 247–258. doi:10.1007/978-3642-02247-0 24. URL http://dx.doi.org/10.1007/978-3-642-02247-0\_24 19

ACCEPTED MANUSCRIPT

CR IP T

[5] A. Said, B. J. Jain, S. Narr, T. Plumbaum, Users and Noise: The Magic Barrier of Recommender Systems, Springer Berlin Heidelberg, Berlin, Heidelberg, 2012, pp. 237–248. doi:10.1007/978-3-642-31454-4 20. URL http://dx.doi.org/10.1007/978-3-642-31454-4\_20

[6] M. Grˇcar, D. Mladeniˇc, B. Fortuna, M. Grobelnik, Data Sparsity Issues in the Collaborative Filtering Framework, Springer Berlin Heidelberg, Berlin, Heidelberg, 2006, pp. 58–76. doi:10.1007/11891321 4. URL http://dx.doi.org/10.1007/11891321\_4

M

AN US

[7] B. M. Marlin, R. S. Zemel, S. Roweis, M. Slaney, Collaborative filtering and the missing at random assumption, in: Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence, UAI’07, AUAI Press, Arlington, Virginia, United States, 2007, pp. 267– 275. URL http://dl.acm.org/citation.cfm?id=3020488.3020521

ED

[8] X. Su, T. M. Khoshgoftaar, A survey of collaborative filtering techniques, Adv. in Artif. Intell. 2009 (2009) 4:2–4:2. doi:10.1155/2009/421425. URL http://dx.doi.org/10.1155/2009/421425

AC

CE

PT

[9] O. Sar Shalom, S. Berkovsky, R. Ronen, E. Ziklik, A. Amihood, Data quality matters in recommender systems, in: Proceedings of the 9th ACM Conference on Recommender Systems, RecSys ’15, ACM, New York, NY, USA, 2015, pp. 257–260. doi:10.1145/2792838.2799670. URL http://doi.acm.org/10.1145/2792838.2799670

[10] J. S. Breese, D. Heckerman, C. Kadie, Empirical analysis of predictive algorithms for collaborative filtering, in: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, UAI’98, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1998, pp. 43–52. URL http://dl.acm.org/citation.cfm?id=2074094.2074100 20

ACCEPTED MANUSCRIPT

CR IP T

[11] M. Silic, G. Delac, I. Krka, S. Srbljic, Scalable and accurate prediction of availability of atomic web services, Services Computing, IEEE Transactions on 7 (2) (2014) 252–264. doi:10.1109/TSC.2013.3. [12] G. Guo, J. Zhang, N. Yorke-Smith, A novel bayesian similarity measure for recommender systems, in: Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI), 2013, pp. 2619–2625.

AN US

[13] G. Guo, J. Zhang, D. Thalmann, N. Yorke-Smith, Etaf: An extended trust antecedents framework for trust prediction, in: Proceedings of the 2014 International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2014, pp. 540–547.

M

[14] P. Massa, P. Avesani, Trust-aware recommender systems, in: Proceedings of the 2007 ACM conference on Recommender systems, 2007, pp. 17–24.

ED

[15] K. Goldberg, T. Roeder, D. Gupta, C. Perkins, Eigentaste: A constant time collaborative filtering algorithm, Information Retrieval 4 (2) (2001) 133–151. doi:10.1023/A:1011419012209. URL http://dx.doi.org/10.1023/A:1011419012209

CE

PT

[16] C.-N. Ziegler, S. M. McNee, J. A. Konstan, G. Lausen, Improving recommendation lists through topic diversification, in: Proceedings of the 14th International Conference on World Wide Web, WWW ’05, ACM, New York, NY, USA, 2005, pp. 22–32. doi:10.1145/1060745.1060754. URL http://doi.acm.org/10.1145/1060745.1060754

AC

[17] S. Dooms, T. De Pessemier, L. Martens, Movietweetings: a movie rating dataset collected from twitter, in: Workshop on Crowdsourcing and Human Computation for Recommender Systems, held in conjunction with the 7th ACM Conference on Recommender Systems, 2013, p. 2.

21

ACCEPTED MANUSCRIPT

CR IP T

[18] G. Tak´acs, I. Pil´aszy, B. N´emeth, D. Tikk, Scalable collaborative filtering approaches for large recommender systems, J. Mach. Learn. Res. 10 (2009) 623–656. URL http://dl.acm.org/citation.cfm?id=1577069.1577091 [19] K. J¨arvelin, J. Kek¨al¨ainen, Cumulated gain-based evaluation of ir techniques, ACM Trans. Inf. Syst. 20 (4) (2002) 422–446. doi:10.1145/582415.582418. URL http://doi.acm.org/10.1145/582415.582418

M

AN US

[20] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, G. Hullender, Learning to rank using gradient descent, in: Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, ACM, New York, NY, USA, 2005, pp. 89–96. doi:10.1145/1102351.1102363. URL http://doi.acm.org/10.1145/1102351.1102363

PT

ED

[21] Q. Shambour, J. Lu, A trust-semantic fusion-based recommendation approach for e-business applications, Decision Support Systems 54 (1) (2012) 768 – 780. doi:https://doi.org/10.1016/j.dss.2012.09.005. URL http://www.sciencedirect.com/science/article/pii/ S0167923612002400

AC

CE

[22] Z. Zhang, H. Lin, K. Liu, D. Wu, G. Zhang, J. Lu, A hybrid fuzzy-based personalized recommender system for telecom products/services, Information Sciences 235 (2013) 117 – 129, data-based Control, Decision, Scheduling and Fault Diagnostics. doi:https://doi.org/10.1016/j.ins.2013.01.025. URL http://www.sciencedirect.com/science/article/pii/ S0020025513000820 [23] M. Mao, J. Lu, G. Zhang, J. Zhang, Multirelational social recommenda-

22

ACCEPTED MANUSCRIPT

tions via multigraph ranking, IEEE Transactions on Cybernetics 47 (12) (2017) 4049–4061. doi:10.1109/TCYB.2016.2595620.

CR IP T

Appendix A. Proof of Eq. (7)

Here we prove Eq. (7) which gives the probability that two random values from the uniform distribution on the interval of length D differ by less than T . If we rescale the interval to become [0, 1], the threshold difference should

AN US

also be rescaled by factor 1/D, i.e., we are looking for the probability that the difference between two random values in [0, 1] is less than T /D.

More precisely, let us assume that X and Y are independent uniformly distributed random variables from [0, 1]. We are looking for the probability

M

density function of their difference X − Y ∈ [−1, 1], which is given by the convolution of the corresponding separate density functions:

ED

Z

PT

fX−Y (x) = (fX ∗ f−Y ) (x) = fX (y)f−Y (x − y) dy Z Z Z 1 1[−1,0] (x − y) dy = = 1[0,1] (y)1[−1,0] (x − y) dy = 0

x

1[−1,0] (t) dt

x−1

CE

= min{x, 0} − max{x − 1, −1} = min{x, 0} − (max{x, 0} − 1)

AC

= 1 − |x|.

Therefore, Pr[−T /D ≤ X − Y ≤ T /D] equals Z

T /D

−T /D

(1 − |t|) dt = 2

Z

0

T /D

 T /D    T (2D − T ) x2 T T2 = (1 − t) dt = 2 x − =2 − , 2 D 2D2 D2 0 23

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

which proves Eq. (7).

24