International Journal of Information Management xxx (xxxx) xxxx
Contents lists available at ScienceDirect
International Journal of Information Management journal homepage: www.elsevier.com/locate/ijinfomgt
Predicting semantic preferences in a socio-semantic system with collaborative filtering: A case study Jean-François Chartiera,*, Pierre Mongeaub, Johanne Saint-Charlesb a
Université du Québec à Montréal, Bureau des Initiatives Numériques (BIN), Centre Interuniversitaire de Recherche sur la Science et la Technologie, Pavillon Paul-GérinLajoie, 8e étage, 1205, Rue Saint-Denis, Montréal, Québec, H2X 3R9, Canada b Université du Québec à Montréal, Faculté de Communication, Département de Communication Sociale et Publique, Pavillon Judith-Jasmin, 3e étage, 405, Rue SainteCatherine Est, Montréal, Québec, H2L 2C4, Canada
ARTICLE INFO
ABSTRACT
Keywords: Collaborative filtering Socio-semantic system Links prediction Semantic preferences Social networks Semantic networks
This paper proposes collaborative filtering as a means to predict semantic preferences by combining information on social ties with information on links between actors and semantics. First, the authors present an overview of the most relevant collaborative filtering approaches, showing how they work and how they differ. They then compare three different collaborative filtering algorithms using articles published by New York Times journalists from 2003 to 2005 to predict preferences, where preferences refer to journalists’ inclination to use certain words in their writing. Results show that while preference profile similarities in an actor’s neighbourhood are a good predictor of her semantic preferences, information on her social network adds little to prediction accuracy.
1. Introduction In the context of massive digitalisation of human activities and access to large-scale data sets containing information about human behaviours and social phenomena, computational models have made it possible to model some social phenomena, such as link creation, dissolution or change, with predictive accuracy. Being able to predict changes in connections or in opinions, or being able to complete missing information about customers’ opinions, for example, offers many advantages to organisations wishing to better understand their ecosystem. In that perspective, the reliability and accuracy of such predictions are of utmost importance for information management. The challenge here is to make accurate predictions, given the complexity of socio-semantic systems combining social networks (social relationships between individual agents) and links between agents and the words they use (Chartier, 2016; Roth, 2005; Roth & Cointet, 2010). “Words” here can be seen as a proxy for ideas and opinions expressed by agents. As an answer to this challenge, this paper proposes to test different models of collaborative filtering algorithms to predict unknown preferences in a socio-semantic system. Collaborative filtering is a widely implemented recommendation system that allows preference predictions (filtering) based on the choices of many other similar (collaborating) agents (Ricci, Rokach, & Shapira, 2011). After presenting the models selected for testing, we proceed with three experiments using a data set of newspaper articles from The New York Times. To our ⁎
knowledge, no study has yet explored the use of collaborative filtering for link prediction in the context of a socio-semantic system. The findings of this study are important for social sciences in which the use of computational models, algorithmic methods and big data is becoming more and more widespread, raising new epistemological questions about the predictability of social phenomena. Several computational social scientists have recently emphasised that the lack of predictive accuracy of models in the social sciences, in comparison with other scientific domains, is something that must be addressed and challenged (Hofman, Sharma, & Watts, 2017; Martin, Hofman, Sharma, Anderson, & Watts, 2016). This paper suggests that collaborative filtering algorithms may be a promising line of research for computational social sciences. 2. The preference prediction problem in networks Preference prediction in networks has spurred the interest of scholars in recent years as predictions have proven valuable in many fields such as online social networks, biological networks, organisational networks, transportation networks and technological networks (Aggarwal, 2015; Al Hasan & Zaki, 2011; Han & Kamber, 2006; Liben‐Nowell & Kleinberg, 2007; Lichtenwalter, Lussier, & Chawla, 2010; Lü & Zhou, 2011; Martínez, Berzal, & Cubero, 2017; Rattigan & Jensen, 2005). Globally, preference prediction algorithms aim at either predicting the future or knowing the unknown. For example, numerous
Corresponding author. E-mail addresses:
[email protected] (J.-F. Chartier),
[email protected] (P. Mongeau),
[email protected] (J. Saint-Charles).
https://doi.org/10.1016/j.ijinfomgt.2019.10.005 Received 21 January 2019; Received in revised form 17 September 2019; Accepted 3 October 2019 0268-4012/ © 2019 Elsevier Ltd. All rights reserved.
Please cite this article as: Jean-François Chartier, Pierre Mongeau and Johanne Saint-Charles, International Journal of Information Management, https://doi.org/10.1016/j.ijinfomgt.2019.10.005
International Journal of Information Management xxx (xxxx) xxxx
J.-F. Chartier, et al.
recommendation systems (such as Facebook friend suggestions) attempt to predict which links will be created, which will remain and which will disappear, and, more generally, how those links will be transformed (through changes in type, direction or strength) at Time 1, based on users’ attributes and links at T0. On the other hand, link prediction methods can be used when information about some of the users in the network is unknown, thus making it possible to carry out operations of “matrix completion.” Regardless of their area of application, predictive models tend to be similar. The main predictive models rely on similarity: the greater the similarity between two individuals according to some criteria, the greater the similarity they will display according to others. For example, the probability that unconnected individuals at T0 will be connected at T1 will depend on how similar they are at T0 (Liben‐Nowell & Kleinberg, 2007). Similarity may be defined in various ways. For instance, with topological similarity, two individuals are deemed similar if they share connections (word use) with the same people (high neighbourhood similarity); in social network terminology, this refers to structural equivalence (Borgatti & Everett, 1992; Breiger, Boorman, & Arabie, 1975; Lorrain & White, 1971).
Netflix user, for instance, a preference could be defined by the number of times she watched a specific film in the past or her ratings of previously seen films. Note that collaborative filtering is not limited to such applications, and many other kinds of preferences involving different kinds of objects can be predicted with this framework. The preference matrix Pca tends to be sparse since we do not usually have data about past decisions made by every agent for every object. In other words, the preference matrix is usually incomplete, and the goal of collaborative filtering is to predict these unknown values. This can be seen as a special case of a matrix completion problem (Aggarwal, 2016). Collaborative filtering algorithms enable predictions about the unknown preferences of an individual based on the known preferences of other agents to whom he or she is somehow linked. Let’s suppose we do not have data about agent ai ’ s preference for object oj , but we do know the preferences for this object shown by other agents in the system. Collaborative filtering states that one can predict poaji with the following aggregative function:
2.1. Preference prediction in socio-semantic systems
The p oaji quantity corresponds to the predicted preference of agent ai for object oj . It is the cumulative preferences of all other agents biased by a weighted coefficient wik representing the relation between agent ai and ak . The parameter is a scaling parameter. In the basic form of collaborative filtering algorithms, wik corresponds to a similarity relation between preference profiles of agents (the row vectors in Poa) . Thus, the more similar agent ak is to agent ai in terms of preference profile, the more this agent will weigh in the collaborative filtering process used to predict the unknown preference of agent ai for oj . The rationale here is social coherence: when agents share similar patterns of preferences for some objects, they are more likely also to share similar preferences for a new object. Though many similarity coefficients could be used to compute similarity – cosine similarity metric, immediate neighbourhood-based process, ranking preferences, and others – it is common to use the Pearson correlation in the following way:
p
Link prediction studies are mostly concerned with “classic” (onemode) social networks in which there are nodes (agents) that are linked together and have attributes. Socio-semantic systems, on the other hand, combine two kind of networks: an “agents x agents” network and an “agents x words” network, the latter being a bimodal network in which there are links between different types of nodes: agents and words. Among predictive models, collaborative filtering algorithms have been used for bimodal networks and therefore constitute relevant models to use with socio-semantic systems. Our study involves a combination of several models from collaborative filtering, machine learning, social network analysis and natural language processing. In what follows, we first present the computational models we have retained to study socio-semantic systems. This section is then followed by a presentation of the case study in which our experiments were carried out.
ai oj
=
1
m j=1
wik =
3. Computational models
ak A i k
m j =1
(wik poaji )
(poaji
(poaji
p¯i
(1)
p¯i )(poajk )2
m i=1
p¯k ) (poajk
p¯k ) 2
(2)
Where p¯i and p¯k are respectively the average preference profiles of agent ai and ak . Accordingly, for independent preference profiles wik = 0 , for perfectly similar preference profile wik = 1 and for opposite preference profil wik = 1.
Collaborative filtering algorithms are the basis of a predictive computational model that tries to capture different signals from social coordination and interactions. The basic assumption of this model is that preferences, and eventually decisions, from one individual can be predicted by looking at the preferences of other individuals who have similar profiles and have usually made similar decisions (Aggarwal, 2016; Herlocker, Konstan, Borchers, & Riedl, 1999; Su & Khoshgoftaar, 2009). In other words, one can predict or infer links between agents and concepts. Predictions might be based on a bipartite network, on an immediate neighbourhood-based process, or on social signals added as a new and independent input.
3.2. Immediate neighbourhood-based collaborative filtering An important variant of collaborative filtering is an immediate neighbourhood-based process. Instead of aggregating preferences from all other agents in the system, the collaborative filtering is limited to the neighbourhood of the agent for whom we want to predict an unknown preference. Let V (ai ) = {a1, ...,an} denote the neighbourhood of ai in the system. The neighbourhood is generally defined as the subset of the top N agents closest to ai . It is, for instance, the top N most similar agents if the weighing scheme wik represents a similarity relation. Thus, an immediate neighbourhood-based collaborative filtering process will correspond to the following function:
3.1. Modelling collaborative filtering with a bipartite network More formally, the model in its simplest form can be described as an agent-object undirected weighted bipartite network Goa = (A, O,Poa) , where A = {a1…an} denotes a set of n agents and O = {o1, …om} denotes a set of m objects. Poa = [poaji ] Rn × m denotes the preference matrix of agents A for objects O. Since collaborative filtering algorithms are usually studied in the context of recommender systems, agents are usually referred to as users of a certain platform such as Netflix, Facebook, and the like, and objects are items such as films, friends, books, etc. The preference matrix is induced from past decisions or explicit ratings made by agents. For a
p
ai oj
=
1
ak V (ai ) i k
(wik poaji )
(3)
Aside from simplifying computation, the reason for limiting the filtering process to the immediate neighbourhood is that most of the weighing relations between agents in the system will be null and thus 2
International Journal of Information Management xxx (xxxx) xxxx
J.-F. Chartier, et al.
distort the prediction.
3.4.1. Signal from structural equivalence Structural equivalence is defined here in terms of common neighbours in a network between two agents ai and ak . Let V (ai ) denote the set of adjacent agents of ai and V (ak ) the set of adjacent agents of ak . Structural equivalence is based on the intersection of V (ai ) and V (ak ). The Jaccard coefficient can be used as follows to compute the structural equivalence e (ai , ak ) between two agents in a network:
3.3. Network-centric collaborative filtering Most important for our goal here is the fact that collaborative filtering is not limited to recommender systems, and its predicting power can be relevant for the study of socio-semantic systems. Indeed, further developments of collaborative filtering models can be seen as attempts to capture signals from social relationships between agents in order to enhance their predictive accuracy. Collaborative filtering algorithms based on social networks are one such attempt to capture more social signals and leverage their predictive power (Aggarwal, 2016; Guy, 2015; Tang, Hu, & Liu, 2013; Yu, Zeng, Gillard, & Medo, 2016). In this kind of algorithm, socio-centric signals among agents are added as a new and independent input. Depending on the context in which collaborative filtering algorithms are used, these social relation signals can be trust, friendship, collaborators, followers, and many others. The assumption behind network-centric collaborative filtering is that an agent’s preferences and decisions are likely to be similar to or influenced by other socially connected agents. Therefore, social connections replace similarity as the weighting scheme in network-centric collaborative filtering. For example, if a user of a recommender system is friends with many other users who have liked (tagged) a particular movie, then she is more likely to like that movie too. Formally, collaborative filtering based on social networks can be described as a bipartite multiplex network Goa = (A, O, Pca, Saa ), where, as previously, A = {a1…an} denotes a set of n agents, O = {o1, …om} denotes a set of m objects and Poa = [poaji ] Rn × m denotes the preference
e (ai , ak ) =
ai oj
=
1
ak A i k
(saaki poajk )
V (ak )| V (ak )|
(5)
Accordingly, another form of collaborative filtering algorithm that can be leveraged by signals from social network structural properties might take the following form:
p
ai oj
=
1
ak A i k
(e (ai , ak ) poajk )
(6)
3.4.2. Signal from geodesic distance Geodesic distance is a topological measure in a network defined as the shortest path between two agents ai and ak (Wasserman & Faust, 1994). The length of a path corresponds to the number of links between ai and ak . Hence, if ai and ak are adjacent (directly connected), their geodesic distance is 1. As with structural equivalence, geodesic distance can be used as a weighing scheme between agents to feed a collaborative filtering process. The use of geodesic distance as a weighing scheme in collaborative filtering is intended to capture signals about indirect relations between agents. The rationale is that if agent ai is connected to agent aj and aj is connected to ak , but ak is not connected to ai , the preferences of ak are still relevant for predicting the preferences of ai , but proportionally to the length of the path between them in the social network.
matrix. Saa = [saaki ] Rn × n is the social matrix representing the weight of social connections between agents A. Like preferences, this parameter depends on available data and the context in which collaborative filtering is applied. Saa can be inferred from interaction frequency. For instance, saaki could simply be the number of times ai and ak were copresent at the same events, or the number of emails they exchanged. saaji could also be the result of an explicit rating by agents, just as objects are rated. In its simplest form, network-centric collaborative filtering states that one can approximate the preference poaji of an agent ai for an object oj with the following aggregative function:
p
|V (ai ) |V (ai )
3.5. Supervised machine learning for collaborative filtering The most common form of collaborative filtering is the aggregative function of weighed preferences. According to this model, the unknown preference of an agent for an object is predicted by the weighted linear combination of preferences for the same object on the part of other agents. The signals from weighted preferences from other agents are simply averaged. However, recent developments in collaborative filtering have shown that more complex dependency patterns between these signals and an agent’s preferences can be captured to leverage predictive accuracy. These patterns are captured with supervised machine-learning algorithms. This framework is sometimes referred to as “model-based collaborative filtering” since instead of postulating a linear aggregative function, a data-driven model is induced by machine learning (Aggarwal, 2016; Su & Khoshgoftaar, 2009). Supervised machine-learning algorithms are data-driven model inductors (James, Witten, Hastie, & Tibshirani, 2013; Mitchell, 1997). They can be used to approximate (or, in other words, to statistically learn a model of) very complex unknown functions between independent and dependent variables, provided that one has enough observations or “exemplars” about their empirical instantiations. In the context of collaborative filtering, the dependent variable is the preference poaji of an agent ai for an object oj , and the independent variables usually correspond to a vector of weighted preferences w = (wi1 poaj1, ...,wik poajk ) for object oj on the part of top N relevant agents, such as agents sharing similar preferences with agent ai or who are connected to this agent by a social relationship. Given a training set of exemplars {(wi, poaji ) W× P ao} collected from known weighted preferences, a machine-learning algorithm seeks to induce a model of the function g: W P ao that can predict any unknown preference. Depending on how preferences are coded (categorical or numerical variables), several classification algorithms or regression
(4)
The predicted preference p of agent ai for object oj is the result of the cumulative preferences of all other agents weighted by the social connectivity saaki between them and agent ai . The parameter is again a scaling factor. The rationale here is that the stronger the social link between agent ai and agent ak , the more likely are the preferences of ak to approximate the preferences of ai . ai oj
3.4. Complex signals from network structures of social relations The collaborative filtering algorithm in Eq. (4) above captures only signals about adjacency links (direct connections) between agents in a social network. In (4), only the direct neighbourhood of an agent impacts preference predictions. While the neighbourhood of an agent in a network is certainly an important signal about social relationships, there are more complex topological structures in a social network that a collaborative filtering algorithm can use as informative signals about social relations. We present here two types of social network structures used in collaborative filtering algorithms to capture different signals about social relationships in networks: structural equivalence (being linked with the same agents) and geodesic distance (number of links that have to be used to reach an agent). 3
International Journal of Information Management xxx (xxxx) xxxx
J.-F. Chartier, et al.
algorithms can be used, notably Bayes classifiers, support-vector machines, decision trees, artificial neural networks and the like. This framework usually outperformed the classical aggregative approach of collaborative filtering in terms of predictive accuracy, as it makes it possible to induce a much more complex and non-linear model of dependency patterns between an agent’s preferences and signals captured by the collaborative filtering process. The drawback is a loss of understanding: despite their highly predictive power, data-driven models induced by machine learning can find relations so complex that they sometimes become black boxes that are hard to comprehend.
important empirical indicator of collaboration and is often used to study collaboration between scientists (Wagner & Leydesdorff, 2003). 4.3. Data preprocessing Data preprocessing refers to the preparation of the data set in a suitable way for text mining analysis. It involves word spelling normalisation with a lemmatization technique, word filtering based on part-of-speech (POS) tagging, and word frequency sorting. A topic-modelling algorithm was retained to extract concepts from the data set (see next section). Because topic-modelling algorithms are based on word co-occurrence redundancy in a corpus, it is important to keep only words reaching a certain frequency threshold: rare words occurring in fewer than 20 documents in the corpus were therefore filtered out. Moreover, not all kinds of words are proper candidates for expressing topics in data sets: determinants, prepositions, numbers, modals, auxiliary verbs and pronouns are irrelevant, and it is crucial to filter them out to reduce noise. Lemmatization and POS were done with the MorphAdorner algorithm (Burns, 2013). The data preprocessing stage resulted in a lexicon of 26,215 distinct words and 22,598,408 occurrences.
4. Methodological considerations for collaborative filtering predictions evaluation The methodological setting necessary for evaluating collaborative filtering is analogous to the non-parametric approach commonly used for machine-learning tasks. This approach is based on the separation between the training data set used to train the model and a test data set used to evaluate its predictive accuracy. The specific characteristic of collaborative filtering evaluation is that the evaluating process takes the form of a matrix completion task that will define the preference matrix of the system (Aggarwal, 2016). Instead of having two separate matrices, one to train the model and one to test it, there is only one matrix from which one will randomly hide a subset of preference values. The subset of unhidden values is retained to feed the collaborative filtering model that will compute the various relational weighting schemes between agents, set neighbourhoods, and aggregate weighted preferences. The trained model is then used to predict the hidden values, and predictive accuracy is obtained by correlating hidden values in the preference matrix with their corresponding predicted values. The objective is to test the predictions from the collaborative filtering model to see if they outperform a baseline that can be seen as a null hypothesis.
4.3.1. Topic modelling Topic-modelling techniques assume that words are used in texts in intentional and meaningful syntagmatic combinatorial patterns, known as co-occurrences, in order to express meanings. Moreover, these cooccurrence patterns are redundant in texts: similar word co-occurrence patterns are used to express similar meanings. As a result, studying these patterns can be informative about the semantic content that makes up a corpus. The correlation between specific word co-occurrence patterns and specific contents (called concepts, topics, and the like) is statistically strong enough that one can infer the latter from the former for study purposes. Topic-modelling algorithms have been developed to exploit this linguistic phenomenon: they identify words with similar redundant co-occurrence patterns among documents and cluster them into concepts. A concept is represented by a weighted list of many words and a word can appear in the list of more than one concept. To retrieve the different concepts in our data set, we used a wellknown topic-modelling algorithm based on the Latent Dirichlet Allocation model (LDA) (Blei, Ng, & Jordan, 2003). It is an unsupervised statistical machine-learning algorithm for topic discovery in texts. LDA is a probabilistic generative machine-learning method that models topics (concepts) as probability distributions over words in a corpus. It assumes that a corpus is organized according to a latent hidden fixed set of topics over its lexicon that determines how words cooccur to form topics. The goal of LDA is to statistically infer the best possible model to fit these topical co-occurrence patterns. The data-modelling stage consisted of encoding the word distribution into a word × document matrix W = [wij]M×N where M = 26,215 and corresponds to the size of our lexicon, N = 132,947 and is the number of documents, and wij is the frequency of word i in document j. Following Blei et al. (2003), we then applied an LDA algorithm to this matrix, together with a Gibbs sampling method as described in Griffiths & Steyvers (2004)1 . The method is iterative: the topic modelling starts from initial random probability distributions and adjusts (i.e. statistically learns) through Gibbs sampling two kinds of conditional probabilities: (1) the probability Pr(w|z) that expresses the assignment of a word w to topic z in the corpus, and (2) the probability Pr(z|d) that corresponds to the proportion of words in a document d assigned to topic z. When a convergence criterion is achieved, the method results in two reduced matrices, Φ = [Pr(w|z)]M×K and Θ = [Pr(z|d)]K×N, where K is the number
4.1. Experimentation with a case study As already noted, collaborative filtering is a computational model employed mostly in the context of recommender systems in which agents are users of digital platforms such as Netflix or Facebook, and objects are items such as films or “friendships”. Our main hypothesis here is that such a system can be used to predict agents’ semantic preferences, defined as word and concept use preferences, and that these predictions will be improved by the inclusion of network-centric signals. Before presenting the experiments we conducted and their results, we introduce the data set and the data preprocessing and modelling steps common to the experiments. The method used is based on natural language processing and text mining techniques. The first steps involve extracting concepts and agents’ semantic preferences from raw documents. 4.2. Data set The data set for this study is comprised of all articles published between 2003 and 2005 in The New York Times (Sandhaus, 2008). From this corpus, we retained only the articles having a publication date, clear authorship and body text. This selection led to 132,947 full-text articles assigned to 3501 different authors. The journalists’ social network consists of co-authorship ties: two journalists are linked if they co-published newspaper articles in our data set. Co-signing, though less frequent than in scientific collaboration, is a practice in journalism. Just over 4% of the papers were cosigned by at least two people, for a total of about 2500. We used a bibliometric method currently used in scientometry and information science (Beaver & Rosen, 1978; Katz & Martin, 1997; Laudel, 2002). This method is based on the postulate that co-authorship is an
1 The LDA was performed with an implementation in R language and the package topic models (Hornik & Grün, 2011).
4
International Journal of Information Management xxx (xxxx) xxxx
J.-F. Chartier, et al.
Table 1 Sample of 10 topics extracted from the corpus and their top 10 words. topic 1
Pr(w|z)
topic 2
Pr(w|z)
topic 3
Pr(w|z)
topic 4
Pr(w|z)
topic 5
Pr(w|z)
project development plan build center site land construction developer property
0.066 0.051 0.048 0.031 0.027 0.024 0.020 0.017 0.016 0.013
plan pension social benefit account security retirement year retire worker
0.050 0.046 0.044 0.043 0.037 0.030 0.029 0.025 0.014 0.014
gay marriage abortion couple sex issue conservative marry support lesbian
0.069 0.057 0.031 0.028 0.028 0.022 0.017 0.013 0.011 0.011
board member director committee chairman meeting executive president meet decision
0.173 0.084 0.048 0.045 0.029 0.029 0.016 0.015 0.011 0.009
department agency federal government program official state service administration office
0.110 0.088 0.059 0.049 0.046 0.037 0.034 0.026 0.018 0.014
topic 6
Pr(w|z)
topic 7
Pr(w|z)
topic 8
Pr(w|z)
topic 9
Pr(w|z)
topic 10
Pr(w|z)
wood play win hole golf round year open player tour
0.040 0.036 0.028 0.025 0.025 0.024 0.020 0.020 0.016 0.015
state governor legislature assembly political Public Office Republican Leader legislator
0.193 0.102 0.025 0.020 0.019 0.018 0.016 0.015 0.015 0.011
weight pound exercise body lose physical people foot work weigh
0.040 0.030 0.025 0.023 0.017 0.016 0.013 0.012 0.012 0.011
review sound moment voice set sense tone style line light
0.017 0.010 0.010 0.008 0.007 0.007 0.006 0.006 0.006 0.005
life live time thing world make learn feel work dream
0.108 0.042 0.030 0.026 0.026 0.025 0.023 0.019 0.019 0.018
of topics. Matrix Φ indicates which word co-occurrence best expresses a given topic in the corpus, while matrix Θ indicates which topics are the most significant in a given document. After several trial-and-error runs, we extracted 200 topics from our corpus. A sample of 10 topics and their top 10 words is shown in Table 1 below. For instance, topic 3 corresponds to a list that includes the words gay, marriage, abortion, couple, sex, issue, conservative, marry, support, and lesbians. These words form a co-occurrence pattern expressing a clear topical content that can be described as “issues related to gay marriage.”
filtering model solely based on the semantic profile similarities of the 3501 journalists. We test the hypothesis that journalist ai ’ s semantic preferences for concept cj , pcaji , can be predicted by the semantic preferences for cj of journalists whose preferences for other concepts resemble those of ai . The experiment was conducted with an aggregative model of the top 10 most similar neighbours (see Section 2.2). Collaborative filtering is calculated on the training set (two thirds of the observations) and predictions are tested on the test set (one third of the observations). Consistency between predictions and observed semantic preferences in the test set is measured with the Pearson correlation and the root-meansquare error (RMSE). The correlation indicates the direction of the consistency, while the RMSE allows us to evaluate the magnitude of the difference. We also created a baseline to evaluate the quality of predictions. The baseline is a predictive model solely based on the average semantic preferences of all journalists in the training set for each concept. This baseline becomes the minimal threshold under which predictions of semantic preferences are considered non-significant. The correlation between the baseline predictions and observed values is 0.13, and the RMSE is 0.0166. Fig. 1 shows the correlation between the collaborative filtering semantic preferences prediction and the observed values (r = 0.77), while Fig. 2 shows the distribution of residuals (RMSE = 0.0118). The correlation between observed values and collaborative filtering semantic preference prediction is largely above the baseline threshold, but analysis of residuals tells another story. It shows that collaborative filtering does indeed have a better overall predictive power than the baseline, but that prediction accuracy varies widely. Fig. 2, showing the distribution of residuals, explains why: although the direction of the consistency is strong, collaborative filtering tends to systematically underestimate predicted semantic preferences, notably for weak preferences (below 0.1).
4.3.2. Semantic preferences A semantic preference is an agent’s propensity toward certain semantic contents. In our case study, this propensity is the likelihood that a journalist will use a particular content (topic) in a newspaper article. Technically, semantic preferences correspond to the posterior probabilities of every one of the 200 topics for every one of the 3501 journalists of our data set. It results in a matrix P=[Pr(z|a)] whose dimensions are 3501 × 200, where Pr(z|a) reflects the propensity of journalist a to use concept z. Following the procedure described in Section 2.5 for evaluating collaborative filtering, one third of the semantic preferences values were hidden, while the rest were used to feed the collaborative filtering computations. The result was a training set of 1,162,332 exemplars and a test set of 938,268 exemplars. 5. Results Since we conducted three experiments, our results are three-fold. First, we present the outcome of an experiment done with a canonical collaborative filtering model in which weighting schemes are similarities between journalists’ semantic preferences. Second, we show the results of an experiment that evaluates predictions made using collaborative filtering with network-centric signals. Finally, we present the outcome of an experiment combining both of the above types of collaborative filtering in a machine-learning framework.
5.2. Aggregative collaborative filtering based on signals from social networks The second experiment explores the predictive value of collaborative filtering using the network-centric signals of the 3501 journalists. Here, we are testing the hypothesis that one can predict the semantic preference pcaji of journalist ai for concept cj by looking at social
5.1. Aggregative collaborative filtering in SSN The first experiment explores the predictive value of a collaborative 5
International Journal of Information Management xxx (xxxx) xxxx
J.-F. Chartier, et al.
collaborative filtering outperform the baseline, but the difference is weak. Collaborative filtering based on structural equivalence achieved the best score (0.20). RMSE results show that network-centric predictions are less accurate than those obtained by collaborative filtering based on semantic profiles. Hence, so far, similarity in semantic profiles appears more informative than social relationships. 5.3. Machine learning for collaborative filtering based on all signals combined The last experiment explores the predictive power of a model of collaborative filtering combining all of the signals used in the previous experiments: semantic preference profile similarities and the three network-centric signals. This experiment aims to evaluate the predictive power of a more complex collaborative filtering model and the marginal contribution of network-centric signals. The assumption behind the integration of new network-centric signals is that these signals are new, informative, and independent from signals captured by the canonical model of collaborative filtering based only on preference profile similarities. In the context of our case study, this rationale suggests that the predictive accuracy of a collaborative filtering model combining all previous signals should significantly outperform the models tested in the previous experiments. In order to combine all these different signals into one integrative model of collaborative filtering, we used the approach based on machine learning discussed in Section 3.5. We tested three different wellknown supervised machine-learning algorithms: linear regression, decision tree, and random forest. The choice of these three algorithms is motivated by a concern to avoid overfitting and underfitting: linear regression is a strongly biased algorithm that tends to underfit, whereas decision trees are algorithms that tend to overfit, since their predictions are characterized by high variance. Random forest can be seen as a trade-off between bias and variance, usually outperforming the two other algorithms. Table 3 shows the predictive accuracy of a collaborative filtering model based on three machine-learning algorithms that integrate all the signals from semantic preference similarities and social networks. The best predictive accuracy is obtained with random forest and the worst with the decision tree. All three outperform the baseline, but only linear regression and random forest outperformed the canonical aggregative model tested in the first experiment. These results provide an important insight into the impact of signals from social networks in collaborative filtering. This impact is modest. At best, with random forest, combining all signals leverages the predictive accuracy of collaborative filtering from a correlation of 0.77 to 0.81 and from a RMSE of 0.0118 to 0.0098. This also means that signals from preference profile similarities are not independent of signals from social networks as we conjectured.
Fig. 1. Agents' semantic preferences observed and predicted in the test set by an aggregative collaborative filtering model based on similarity between profiles.
Fig. 2. Residuals of preferences predicted by a top 10 neighbourhood aggregative collaborative filtering model based solely on the signal from preference profile similarities.
relationships between journalists. We have retained the three types of signal presented in Section 3: geodesic distance (social proximity), structural equivalence (being linked with the same agents), and co-authorship frequency. Again, we have set the neighbourhood as the top 10 closest agents. Three types of socio-centric collaborative filtering were calculated for the training set and three types of prediction were carried out for the test set. Again, we calculated the Pearson correlation and the RMSE, and we used the same baseline. Table 2 shows the results of this experiment. Predictions obtained by the three types of network-centric
6. Discussion Given the importance of the ability to predict people’s opinions, we have used this case study to explore the use of collaborative filtering algorithms to predict semantic preferences of agents in a socio-semantic
Table 2 Prediction accuracy of aggregative collaborative filtering based on social network signals. Signal from social network
Correlation
RMSE
Social proximity Structural equivalence Co-authorship frequency Baseline
0.18 0.20 0.15 0.13
0.017 0.017 0.019 0.017
Table 3 Prediction accuracy of a collaborative filtering model based on machine learning and the integration of all signals.
6
Machine-learning algorithms
Correlation
RMSE
Linear regression Decision tree Random forest
0.79 0.73 0.81
0.0102 0.0114 0.0098
International Journal of Information Management xxx (xxxx) xxxx
J.-F. Chartier, et al.
system. This case study was carried out using a data set of newspaper articles published between 2003 and 2005 in The New York Times. Our main results show that collaborative filtering can very accurately predict semantic preferences in this socio-semantic system, but they also show that in this instance, taking social relationships into consideration does not add much to the quality of the prediction. At the very least, we can question the value added to the prediction, given the complexity of the analysis. Results suggest that social relationships may not be an important determinant of socio-semantic structures, and support the assumption of a discrepancy between social and socio-semantic information suggested by the works of Rowe, Stankovic and Alani (2012). Should we then conclude that our study does not support the practical value of studies conducted in the socio-semantic field? We do not think so. Rather, we believe our study raises new questions worth exploring. These questions are examined in the following sections.
6.3. Semantic relationships Traditionally, the term “semantic network” designated a network of concepts (Carley, 1986, 1997; Nerghes, Lee, Groenewegen, & Hellsten, 2015). Links between concepts are sometimes inferred from their usage by the same agent: if agent a uses concepts x and y, then they are seen as linked, and the more agents use the two concepts together, the stronger is the link. Other mechanisms such as inference, association, coherence or classification (Borge-Holthoefer & Arenas, 2010; Gärdenfors, 2004; Griffiths, Steyvers, & Tenenbaum, 2007; Sowa, 2006; Steyvers & Tenenbaum, 2010; Widdows, 2004; Zwarts, 2010) drive the evolution of the links between concepts. Adopting this point of view would suggest that semantic relationships, rather than social relationships between agents, are what drives the transformation of the agent-concept link. Exploring this hypothesis is yet another path to follow. Item-based collaborative filtering (Ha & Lee, 2017; Linden, Smith, & York, 2003; Sarwar, Karypis, Konstan, & Riedl, 2001), a model we did not discuss in this paper, could be a highly intuitive framework within which this hypothesis could be investigated.
6.1. Collaborative filtering as a way to capture social signals Collaborative filtering algorithms have been used in recommender systems for almost 20 years. Their predictive accuracy is usually high despite the fact that they are based on a surprisingly simple computational model. Some have postulated that this accuracy may be explained by the fact that collaborative filtering algorithms implemented in recommender systems such as Netflix and others capture a signal about a real social process at work in a system. That is, the signal captured by collaborative filtering algorithms would give us information us about various social influence processes at work in a social system. Indeed, there is a strong formal resemblance between collaborative filtering models and empirically-based social influence models such as contagion (Bothner, 2003; Burt, 1987, 2010; Galaskiewicz & Burt, 1991; Galaskiewicz & Wasserman, 1989; Leenders, 2002; Mizruchi, 1989, 1993). For instance, a collaborative filtering model using preference profile similarities would capture the signal of social contagion based on mimicry – the fact that an agent’s preferences tend to be influenced by preferences and behaviours from other, similar agents (Bothner, 2003; Burt, 1987, 2010; Galaskiewicz & Burt, 1991; Galaskiewicz & Wasserman, 1989; Leenders, 2002; Mizruchi, 1989, 1993). If this postulate is valid, then adding social relationships is less important, since their main effect on the process under study would already be captured. We argue that transforming this postulate into a hypothesis and trying to verify it would be an important contribution to the study of socio-semantic systems.
6.4. Forecasting news dynamics Before concluding, we want to make a final observation about the link prediction problem in a socio-semantic system involving journalists. Our study might make a notable contribution to the field of news dynamics forecasting, another recent and important topic of interest (Bandari, Asur, & Huberman, 2012; Lerman & Ghosh, 2010; Lerman & Hogg, 2010; Wu & Shen, 2015). Indeed, the possibility of making forecasts about the conceptual content of news would be extremely valuable to the news content provider industry, policy-makers, advertisers, and journalists themselves. 7. Conclusion In the context of massive digitalisation of human activities and access to large-scale data sets, being able to predict or infer preferences of agents (customers, for example) is the object of a constant quest in which success provides a competitive advantage. From this perspective, the accuracy of such predictions is of utmost importance for information management. In this paper, we have sought to harness collaborative filtering algorithms, combining them with consideration of social signals to unravel part of the complexity of the socio-semantic system. We conducted a case study of a socio-semantic system composed of journalists writing for The New York Times between 2003 and 2005. We show that in this network, collaborative filtering based on semantic preference similarities between agents can predict unknown semantic preferences with great accuracy. We also see, in this case study, that adding signals from social relationships has little positive impact on the predictive accuracy of collaborative filtering. We suggest that the reason why these new signals do not leverage the predictive power of collaborative filtering as expected is that simple collaborative filtering based on similarity between agents’ preferences already captures important signals about social relationships. These results also pave the way for new questions about the limits of co-authorship as a proxy for journalists’ social relationships, and invite us to explore how collaborative filtering might incorporate the evolution of semantic links between concepts as a new signal that could increase its predictive power.
6.2. Social relationships The social relationship in our case study is co-authorship, a proxy that is much in use in socio-semantic and bibliometric studies (Wagner & Leydesdorff, 2003). Co-signing is, of course, not reserved to scientists. Although less frequent, it is also a practice in journalism. Accordingly, we have made use of this practice to identify collaborative ties between journalists in our socio-semantic system. This methodological decision has well-known limits. Notably, two of its postulates are questioned (Katz & Martin, 1997; Laudel, 2002). First, this method assumes that all co-authors have effectively worked together; this, at least in the scientific world, is not necessarily the case, as is evidenced by the “honorary authorship” practice (Kovacs, 2013). Second, the method also implies that all who have collaborated have also co-signed, which again is not always the case given the diversity of collaborative practices. Such limitations and others emerging from the world of journalism may explain the weak increase in prediction accuracy in our results. Better indicators of “real” social relationships might shed more light on the influence of social relationships on ties between agents and concepts.
Funding This work was supported by the Social Sciences and Humanities Research Council of Canada (SSHRC) [grant numbers 435-2017-1131, 2017]. 7
International Journal of Information Management xxx (xxxx) xxxx
J.-F. Chartier, et al.
Declaration of Competing Interest
honest contributors of every multi-author article. Journal of Medical Ethics, 39(8), 509–512. https://doi.org/10.1136/medethics-2012-100568. Laudel, G. (2002). What do we measure by co-authorships? Research Evaluation, 11(1), 3–15. Leenders, R. T. A. (2002). Modeling social influence through network autocorrelation: Constructing the weight matrix. Social Networks, 24(1), 21–47. Lerman, K., & Ghosh, R. (2010). Information contagion: An empirical study of the spread of news on Digg and twitter social networks. ICWSM, 10, 90–97. Lerman, K., & Hogg, T. (2010). Using a model of social dynamics to predict popularity of news. Proceedings of the 19th international conference on world wide web (pp. 621–630). Liben‐Nowell, D., & Kleinberg, J. (2007). The link‐prediction problem for social networks. Journal of the American Society for Information Science and Technology, 58(7), 1019–1031. Lichtenwalter, R. N., Lussier, J. T., & Chawla, N. V. (2010). New perspectives and methods in link prediction. Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 243–252). Linden, G., Smith, B., & York, J. (2003). Amazon.com recommendations: Item-to-item collaborative filtering. IEEE Internet Computing, 7(1), 76–80. https://doi.org/10. 1109/MIC.2003.1167344. Lorrain, F., & White, H. C. (1971). Structural equivalence of individuals in social networks. The Journal of Mathematical Sociology, 1(1), 4980. https://doi.org/10.1080/ 0022250X.1971.9989788. Lü, L., & Zhou, T. (2011). Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and Its Applications, 390(6), 1150–1170. Martin, T., Hofman, J. M., Sharma, A., Anderson, A., & Watts, D. J. (2016). Exploring limits to prediction in complex social systems. Proceedings of the 25th International Conference on World Wide Web (pp. 683–694). Martínez, V., Berzal, F., & Cubero, J.-C. (2017). A survey of link prediction in complex networks. ACM Computing Surveys (CSUR), 49(4), 69. Mitchell, T. M. (1997). Machine learning. McGraw-Hill Science/Engineering/Math. Mizruchi, M. S. (1989). Similarity of political behavior among large American corporations. The American Journal of Sociology, 401–424. Mizruchi, M. S. (1993). Cohesion, equivalence, and similarity of behavior: A theoretical and empirical assessment. Social Networks, 15(3), 275–307. Nerghes, A., Lee, J.-S., Groenewegen, P., & Hellsten, I. (2015). Mapping discursive dynamics of the financial crisis: A structural perspective of concept roles in semantic networks. Computational Social Networks, 2. https://doi.org/10.1186/s40649-0150021-8. Rattigan, M. J., & Jensen, D. (2005). The case for anomalous link discovery. ACM SIGKDD Explorations Newsletter, 7(2), 41–47. Ricci, F., Rokach, L., & Shapira, B. (2011). Introduction to recommender systems handbook. In F. Ricci, L. Rokach, B. Shapira, & P. B. Kantor (Eds.). Recommender systems handbook (pp. 1–35). Boston, MA: Springer US. https://doi.org/10.1007/978-0-38785820-3_1. Roth, C. (2005). Co-evolution in epistemic networks: Reconstructing social complex systems. Sciences Humaines et Sociales, Thèse, 232. Roth, C., & Cointet, J.-P. (2010). Social and semantic coevolution in knowledge networks. Social Networks, 32, 16–29. https://doi.org/10.1016/j.socnet.2009.04.005. Rowe, M., Stankovic, M., & Alani, H. (2012). Who will follow whom? exploiting semantics for link prediction in attention-information networks. International Semantic Web Conference (pp. 476–491). Sandhaus, E. (2008). The New York Times annotated corpus. Linguistic Data Consortium, Philadelphia, 6(12), e26752. Sarwar, B., Karypis, G., Konstan, J., & Riedl, J. (2001). Item-based collaborative filtering recommendation algorithms. Proceedings of the 10th international conference on World Wide Web (pp. 285–295). Sowa, J. F. (2006). Semantic networks. Encyclopedia of cognitive science. Steyvers, M., & Tenenbaum, J. B. (2010). The large-scale structure of semantic networks: Statistical analyses and a model of semantic growth. Cognitive Science, 29(1), 41–78. Su, X., & Khoshgoftaar, T. M. (2009). A survey of collaborative filtering techniques. Advances in Artificial Intelligence, 2009, 1–19. https://doi.org/10.1155/2009/421425. Tang, J., Hu, X., & Liu, H. (2013). Social recommendation: A review. Social Network Analysis and Mining, 3(4), 1113–1133. Wagner, C. S., & Leydesdorff, L. (2003). Mapping global science using international coauthorships: A comparison of 1990 and 2000. Proceedings of ninth international conference on scientometrics and informetrics. Wasserman, S., & Faust, K. (1994). Social network analysis. Methods and applications. Cambridge University Press. Widdows, D. (2004). Geometry and meaning. Standford: CSLI Publications. Wu, B., & Shen, H. (2015). Analyzing and predicting news popularity on Twitter. International Journal of Information Management, 35(6), 702–711. Yu, F., Zeng, A., Gillard, S., & Medo, M. (2016). Network-based recommendation algorithms: A review. Physica A: Statistical Mechanics and Its Applications, 452, 192–208. Zwarts, J. (2010). Semantic map geometry: Two approaches. Linguistic Discovery, 8(1), 377–395.
None. References Aggarwal, C. C. (2015). Data mining: The textbook (2nd ed.). Springer. Aggarwal, C. C. (2016). Recommender systems. Springer. Al Hasan, M., & Zaki, M. J. (2011). A survey of link prediction in social networks. Social network data analytics. Springer243–275. Bandari, R., Asur, S., & Huberman, B. A. (2012). The pulse of news in social media: Forecasting popularity. ICWSM, 12, 26–33. Beaver, D., & Rosen, R. (1978). Studies in scientific collaboration: Part I. The professional origins of scientific co-authorship. Scientometrics, 1(1), 65–84. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. Borgatti, S. P., & Everett, M. G. (1992). Notions of position in social network analysis. Sociological Methodology, 22, 1. https://doi.org/10.2307/270991. Borge-Holthoefer, J., & Arenas, A. (2010). Semantic networks: Structure and dynamics. Entropy, 12(5), 1264–1302. Retrieved from http://www.mdpi.com/1099-4300/12/ 5/1264. Bothner, M. S. (2003). Competition and social influence: The diffusion of the sixth‐generation processor in the global computer Industry1. The American Journal of Sociology, 108(6), 1175–1210. Breiger, R. L., Boorman, S. A., & Arabie, P. (1975). An algorithm for clustering relational data with applications to social network analysis and comparison with multidimensional scaling. Journal of Mathematical Psychology, 12, 328–383. Burns, P. R. (2013). MorphAdorner v2: A java library for the morphological adornment of english language texts. Retrieved fromNorthwestern Universityhttps://morphadorner. northwestern.edu/morphadorner/download/morphadorner.pdf. Burt, R. S. (1987). Social contagion and innovation: Cohesion versus structural equivalence. The American Journal of Sociology, 1287–1335. Burt, R. S. (2010). The shadow of other people: Socialization and social comparison in marketing. The connected customer: The changing nature of consumer and business markets217–256. Carley, K. M. (1986). An approach for relating social structure to cognitive structure. The Journal of Mathematical Sociology, 12, 137–189. https://doi.org/10.1080/0022250X. 1986.9990010. Carley, K. M. (1997). An approach for relating social structure to cognitive structure. Journal of Organizational Behavior, 18, 533–558. Chartier, J.-F. (2016). Reconstruction d’un système sociosémantique par apprentissage machine. Montréal: Université du Québec à Montréal. Galaskiewicz, J., & Burt, R. S. (1991). Interorganization contagion in corporate philanthropy. Administrative Science Quarterly, 88–105. Galaskiewicz, J., & Wasserman, S. (1989). Mimetic processes within an interorganizational field: An empirical test. Administrative Science Quarterly, 454–479. Gärdenfors, P. (2004). Cooperation and the evolution of symbolic communication. In D. K. Oller, & U. Griebel (Eds.). Evolution of communication systems: A comparative approach (pp. 237–256). Cambridge: The MIT Press. Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1), 5228–5235. https://doi.org/10.1073/pnas. 0307752101. Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. (2007). Topics in semantic representation. Psychological Review, 114(2), 211. Guy, I. (2015). Social recommender systems. Recommender systems handbook. Springer511–543. Ha, T., & Lee, S. (2017). Item-network-based collaborative filtering: A personalized recommendation method based on a user’s item network. Information Processing & Management, 53(5), 1171–1184. Han, J., & Kamber, M. (2006). Data mining: Concepts and techniques. San Francisco: Morgan Kaufmann. Herlocker, J. L., Konstan, J. A., Borchers, A., & Riedl, J. (1999). An algorithmic framework for performing collaborative filtering. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval SIGIR’ 99 (pp. 230–237). . https://doi.org/10.1145/312624.312682. Hofman, J. M., Sharma, A., & Watts, D. J. (2017). Prediction and explanation in social systems. Science, 355(6324), 486–488. Hornik, K., & Grün, B. (2011). Topicmodels: An R package for fitting topic models. Journal of Statistical Software, 40(13), 1–30. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. Springer. Katz, J. S., & Martin, B. R. (1997). What is research collaboration? Research Policy, 26(1), 1–18. Kovacs, J. (2013). Honorary authorship epidemic in scholarly publications? How the current use of citation-based evaluative metrics make (pseudo)honorary authors from
8