Expert Systems With Applications 141 (2020) 112955
Contents lists available at ScienceDirect
Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa
SoTaRePo: Society-Tag Relationship Protocol based architecture for UIP construction Shubham Goel, Ravinder Kumar∗ Computer Science and Engineering Department, Thapar Institute of Engineering & Technology, Patiala, Punjab, India
a r t i c l e
i n f o
Article history: Received 16 February 2019 Revised 29 June 2019 Accepted 14 September 2019 Available online 19 September 2019 Keywords: Social network Social tagging Protocol Ranking Recommendations,
a b s t r a c t As with the advancement of web services, there has been a rapid proliferation in web size and number of web users, where, each user holds a different viewpoint towards the same information. This, in turn, has become a big challenge for the web search platforms to interpret the preferences of the users and provide the desired information to them. The most suitable solution to the problem of search platforms is personalization of web search. A personalization system is a kind of expert and intelligent system which can automatically learn about the preferences of a user so that the system can provide the search results as per their relevance to a user. The process of acquiring knowledge about user’s preferences by a personalization system is known as User Interest Profile (UIP). In the field of search personalization, it can also not be denied that only an efficient and complete UIP can lead to an effective and high performing web search personalization methodology design. But most of the studies conducted for web search personalization have only focused on UIP modeling without any thought about the quality of UIP. Rather limited attention has been paid to sparsity issue of UIP modeling. In this paper, we propose a novel protocol based architecture model to create an efficient UIP by exploiting direct and indirect interest of a user. Direct interest aims at mining user’s preferences from his own activities on a social information platform. The explicitly defined society and real-world activity relationships of a user on a social platform are used to predict his indirect interest as UIP constructed solely on the basis of direct interest is sparse and ineffective. In order to unearth user’s activity relationships the concept of semantic relatedness, computed using Word2vec model, has been used. Moreover, different trust levels in society relationships have also been incorporated into the proposed model to facilitate the prediction of user’s indirect interest. A series of experiments have been conducted on a del.icio.us dataset to evaluate the effectiveness of the proposed model. The results show that the model has outperformed each and every baseline in relation to complete and efficient UIP construction. © 2019 Published by Elsevier Ltd.
1. Preamble In the recent years, the information retrieval efficiency of web search engines has considerably decreased as these are inoperable to support the voluminous size of web and diversity of user preferences. Generally, the search engines do not consider the query issuer’s preferences; and almost a common result set is retrieved for everyone. However, in the current scenario of information requirement, the approach of “one size fits all”, i.e., similar package of results for everyone is completely undesirable as users keep a different viewpoint on each topic. For example, in the case of a query “java”, for some users it may represent an ∗
Corresponding author. E-mail addresses:
[email protected] (S. Goel),
[email protected] (R. Kumar). https://doi.org/10.1016/j.eswa.2019.112955 0957-4174/© 2019 Published by Elsevier Ltd.
island, while for others, it may mean a programming language. Thus, the retrieval of a results must depend on the preference of a user who is making the query, i.e., whether he is a nature loving person or a computer programmer. The application of traditional search approaches in such situations would led to only re-framing of queries, frustration, time wastage or even unfruitful search sessions. So, according to the demand of information retrieval market, the process needs to be transitioned from a generalized approach to the personalized one, i.e., it requires to be more user-centric. In order to enlist the user preferences, his User Interest Profile (UIP) must be constructed. Research communities have thoroughly analyzed the effectiveness of UIP created on the basis of browsing history (Liu, Yu, & Meng, 2004; Makvana, Shah, & Shah, 2014; Sugiyama, Hatano, & Yoshikawa, 2004; Tan, Shen, & Zhai, 2006), location (Tapia-Fernández, Rodríguez, Velázquez, Seco, & Jiménez, 2015; Zhou, Xie, Wang, Gong, & Ma, 2005), desktop
2
S. Goel and R. Kumar / Expert Systems With Applications 141 (2020) 112955
files (Chirita, Firan, & Nejdl, 2006; Teevan, Dumais, & Horvitz, 2005), clickthrough information (Chawla, 2016; Shen, Tan, & Zhai, 2005), social information (Bouadjenek, Hacid, & Bouzeghoub, 2013; Kim & Park, 2013; Morris, Teevan, & Panovich, 2010), etc. But the profiles obtained using the social information are proved to be more effective than other methods. After obtaining the list of user interests in the form of a UIP, personalization of his web search is performed with respect to the UIP either by reformulating the user’s query or by results re-arrangement. There are many social information platforms, but collaborative tagging sites (like del.icio.us, MovieLens, flicker), microblogging site (like friendfeed, tumbler, twitter) and social network (like linkedin, snapchat) are the prominent ones. Knowingly or unknowingly, people are continuously generating a large amount of information about themselves by means of these platforms to be present on web. On the whole, this information is a very rich source for predicting user preferences. A user profile constructed solely on the basis of user’s own activities will result in a sparse and inefficient profile as the performance of UIP is greatly influenced by the amount of user’s information taken as input to UIP construction methodology. It all depends on the frequency of social network activities of a user because the amount of user’s information available on the web is directly proportional to the measure of user’s activities. The performance of UIP will directly affect the efficiency of a web search personalization approach as UIP of a user is the backbone of every personalization algorithm designed either for web search or recommender systems. Thus, to prevent injustice on account of an irregular user, some additional information must be linked with the user’s account for construction of an UIP. The linking of additional information to UIP of a user is known as UIP augmentation. Different strategies have been followed by various researchers for the augmentation of a user profile. Some have used tag clustering (Kumar, Lee, & Kim, 2014), community information (Shafiq, Alhajj, & Rokne, 2015), resource correlations (Xie et al., 2016) and tagging actions (Bouadjenek, Hacid, Bouzeghoub, & Vakali, 2016). The information accumulated for UIP augmentation will help to create an efficient and complete UIP. However, these days, employing a single strategy for UIP augmentation is not enough. Therefore, a strong UIP must be a joint venture of user’s information acquired from multiple strategies. All the research studies conducted till now have focused on employing a single or no strategy for UIP augmentation. This paper proposes an innovative protocol based architecture of UIP construction model, where, each protocol is equipped with different prediction and learning approach. The concept of semantic relatedness between different tag pairs has been used to unearth various latent real-world relationships between the tags under consideration. The actual world society relationship network explicitly defined by a user has also been explored for UIP augmentation in a novel way. Society relationships of a user defines the type of relationship at what level a user has with other persons who are, directly or indirectly, known to a user. Finally, the information acquired with the help of different protocols in the proposed model is combined together to give an efficient and complete UIP. In case of web search engines, a personalization methodology act as an intelligent interface between a user and web search engine. After receiving the search results, the personalization methodology will re-arrange the results based on the UIP of a query issuer to provide an list of relevant results in order of their relevance. The other action that can be performed by an intelligent interface is to reformulate the user’s query according to the knowledge of user’s UIP before passing to the search engine. Moreover, a personalization methodology can also be embedded into a recommender system to provide personalized recommendations to the
users. Any claim of designing a personalization methodology without the knowledge of user’s interest, i.e., absence of UIP construction or learning model is baseless as presenting the information to a user according to his/her preference is the only goal of personalization. Moreover, It can also not be denied that only an efficient and complete UIP can lead to an effective and high performing web search personalization methodology design. Therefore, the only motive behind work presented in this paper is to contribute towards the designing of an intelligent and expert system that can provide a personalized search experience to a web user. To conduct the experiments, social information corresponding to various users has been obtained from a well-known collaborative tagging site, i.e., del.icio.us. It allows the users to annotate various web pages using any tags of their interest. The information provided by a collaborative tagging site is the first-hand information generated by a user himself, therefore, its credibility is much more than information collected from any other source. The specific contribution of the work presented in this research paper as follows: •
•
•
•
Analysis of the user’s collaborative tagging information, exploration of real-world society and tag-tag group relationships of the user. Designing and implementation of the protocol based architecture for UIP construction based on explored relationships. Comparison of UIP’s constructed by proposed model and various other baseline methodologies. Quantification of the comparison based on P@K, MRR, RIL and comp evaluation parameters and results verification using hypothesis testing.
Rest of the paper is structured as follows: Section 2 discusses the various UIP construction methodologies starting with some basic to various UIP augmentation strategies. Section 3 presents the proposed model of UIP construction with detailed examples of each strategy used to predict direct-indirect interest of a user. Section 4 describes the dataset used for training and testing the proposed model, evaluation metrics, and some prominent UIP construction baselines. Section 5 shows the various experiments conducted to perform the comparative analysis of the proposed model with different baselines. It also describes the testing of various null hypotheses. Section 6 provides the summary of this work; and also caries certain directions for further research. 2. Literature survey User profiling is a requisite technique to perceive the interest and behavior of a user. More specifically, collaborative tagging based user profiling is generally used to represent the user preferences on the social web which in turn provide an assistance to web search personalization algorithms. Interactive Internet Applications like del.icio.us for tagging of web pages, flicker for tagging of images, MovieLens for tagging of movies, etc. are some prominent collaborative tagging platforms that facilitate a web user to annotate a web page with a tag of interest. Formally, collaborative tagging is represented by Collaborative 3-partite Graph (C3TG) as follows: Definition 1. A Collaborative 3-partite Graph (C3TG) is a special type of graph denoted by G3 (V, E) such that V ∈ (U ∪ T ∪ W) and u . On the whole, C3TG can be stored into the memory as E ∈ Rt,w u ) and Ru a quadruple (u, t, w, Rt,w t,w ⊂ (U × T × W). where, U, T, and W depicts the User, Tag, and Web Resource sets u respectively. The ternary relation denoted by Rt,w depicts that a web resource named w is annotated by a user u using the tag of interest t. Some researchers inferred UIP for personalization of
S. Goel and R. Kumar / Expert Systems With Applications 141 (2020) 112955
web search (Bao et al., 2007; Bouadjenek et al., 2016; Cai, Li, Xie, & Min, 2014; Du et al., 2016; Kumar & Kim, 2011; Kumar et al., 2014; Shafiq et al., 2015), while some others used it for recommender systems (Hannon, Bennett, & Smyth, 2010; Liu & Lee, 2010; Luo, Ouyang, & Xiong, 2012; Yang, Guo, Liu, & Steck, 2014). SocialPageRank (SPR) and SocialSimRank (SSR) are algorithms for web search designed by Bao et al. (2007), where the tags used by user himself for annotating various web pages formed the basis of UIP construction. The prediction of most probable value for the level of interest a user holds for a particular tag always remains a cause of concern for the researchers as the approaches like TF (Noll & Meinel, 2007), TF_IUF (Xu, Bao, Fei, Su, & Yu, 2008), BM25 (Vallet, Cantador, & Jose, 2010), etc. have their own pros and cons. Cai et al. (2014) studied all these issues and proposed Normalized Term Frequency (NTF) for assigning a proper weight to a tag as per the user preference degree. The results obtained by them are more satisfying as compared to the previous approaches. In the study conducted by Bouadjenek et al. (2016), an adaptation of a wellknown TF-IDF method of information retrieval was made to create UTF-IUF, which in turn has been used for UIP construction. The effect of integrating this information about a user into the indexing schema of information retrieval systems has also been deeply studied. Bouadjenek et al. (2013) have given the concept of LAICOS which is a web search engine for information retrieval. LAICOS performs the function of searching by considering tags or annotation available with the web page as metadata for construction of UIP and traditional method of textual content matching between query and document. The UIP constructed using only the user’s own tags provides an incomplete view of the user profile which fails to enlist each and every preferable item of a user. Thus, it is necessary to incorporate some additional information for UIP augmentation. In the study conducted by Shepitsen, Gemmell, Mobasher, and Burke (2008), the interest of a user and web resources which comply with the interest are determined using the concept of tag clustering on all the tags present in the collaborative tagging. The assumption forming the basis for adopting tag clustering is that cluster comprises of various tags associated with each other in some context either by syntactic or semantic similarity. Two methods of UIP construction, viz. svdCUIP and modSvdCUIP have been devised in their work by Kumar et al. (2014) based on the same assumption of tag clustering, but the method of tag relationship determination and cluster formation is different. As depicted by the naming convention of the methods, the technique of Singular Value Decomposition is utilized to identify the similarity level of tags in the dataset. As per the results presented by them, modSvdCUIP is far better performer than its counterpart in terms of similarity level identification. The sentiment dictionary was integrated into UIP construction methodology for its augmentation by Xie et al. (2016),. The benefit of incorporating the community information of a user was also analyzed by some researchers based on a universally accepted fact that the surrounding in which a person lives or works is the major deciding factor of his attitude, preferences and behavior (Kim & Park, 2013; Xie et al., 2014). In their work, Xie et al. (2014) proposed a novel approach of Multi-faceted folksonomy graph (MFG) to integrate multi-faceted relation in social media. Latent user communities are mined from social media applications by using MFG which in turn help to identify the additional information for UIP augmentation. For the modeling of user interest, Huang, Yeh, Lin, and Wu (2014) fused together the frequency, duration and recency of tags with the neighbors’ information from social friends network. The resultant profile is then used for making collaborative recommendations. Shafiq et al. (2015) has also analyzed the user’s friendship network for UIP augmentation, where a pair of users have been considered as friends only when they are involved in updating the common page of Wikipedia at
3
some point in a history. In order to compute the level of trust, a user has on a particular friend two special types of matrices, i.e., relevancy and credibility. Friends which are selected by applying different analyzing techniques on a network formed by the trustworthy friends of the user are only liable to provide recommendations. In their research work, Al-Shamri (2016) discussed various methodologies used for construction of a user profile in recommender systems specially for demographic-based recommenders. A detailed description of suitable similarity measuring approaches is provided with their advantages and disadvantages. The current methodologies of UIP construction are blind to the real-world social relationships that surround users and web pages. In the overall literature survey of UIP construction methodologies, the information selected for UIP augmentation was identified either using the similarity relationships existing between a pair of tags or a pair of users. Mostly, everyone had used the same kind of approach to compute these similarities. Some of the researchers had used intersection relation of tag sets, while others used the common resource sharing or editing actions to compute user-user similarity which helped to identify the user community. In a physical world scenario, defining a credibility matrix on the shoulders of these latent communities is not possible; and any claim of constructing an accurate credibility matrix is groundless as friendship is a completely personal decision. No one can instruct anyone to be your friend. The similarity strength between a pair of tags is either dependent on tag sharing by two different web resources or the tags which are used for annotating the same web resource. But in the real-world, every kind of relation determined on the basis of these approaches is insignificant, as it is just being a matter of chance. Furthermore, the dataset used as a foundation ground to build such kind of tag relations is very short. However, in the case of proposed model presented in this paper, an innovative protocol based approach has been designed to construct an efficient and complete user interest profile. The actual world society and tag relationships have been explored for augmenting the UIP of a user. To summarize the strengths and weaknesses of various UIP construction methodologies, studied in the literature, over each other a higher level comparison is shown with the help of Table 1. In addition to this, the proposed model has also been compared with other UIP construction methodologies in Table 1. Generally, classic sorting algorithms are used by the researchers to sort the UIP with respect to degree of interest. But the UIP, obtained from a UIP construction methodology, is already in partially sorted order; and usage of standard sorting algorithms, in this case, will just contribute towards a positive increase in system time complexity. So, keeping this challenge in consideration, a specially designed algorithm named CBIS by Goel and Kumar (2018) has been incorporated in the proposed model for sorting the partially sorted data. 3. Proposed model: Society-Tag Relationship Protocol based architecture In the current scenario, employing a single strategy to predict the interest of a user is not enough. There must be a number of strategies with each using a different set of rules to embed various pattern recognition and learning algorithms for user interest prediction. This section describes the proposed protocol based architecture model for the prediction of user interest (directindirect) and constructing a User Interest Profile (UIP) from it. The first protocol has been used to obtain user own tags with appropriate weights, and also serves as a foundation ground for the remaining protocols, therefore, it is named as Base protocol (Bp ). The remaining two protocols, i.e., Guild protocol (Gp ) and Congregation protocol (Cp ) have been used to recommend tags from society
4
Table 1 Comparison between various UIP construction methodologies studied in literature with the proposed model.
UIP construction methodology Bao et al. (2007); Noll and Meinel (2007) Xu et al. (2008)
UIP augmentation type
UIP augmentation approach
User Interest weighting technique
UIP parametric basis
Relationship type
✗
✗
TF
User’s tags
Ternary
✗
✗
TF-IUF
User’s tags
Ternary
Tag-Tag
Tag clusters using HAC
TF
Vallet et al. (2010) Liu and Lee (2010)
✗ User-User
✗ Nearest neighbor network
BM25 TF
Hannon et al. (2010)
User-User
Follower-Followee relationship frequency
TF-IDF
Luo et al. (2012)
User-User
TF
Kim and Park (2013)
User-User
Neighborhood based model using tagging relations on same item Topic similarity in topic based profile
Cai et al. (2014) Kumar et al. (2014)
✗ Tag-Tag
Xie et al. (2014)
User-User
Huang et al. (2014)
Shafiq et al. (2015) Du et al. (2016)
-
✗ Tag clusters using HAC & SVD based tags similarity Latent user community & multi-faceted folksonomy graph
NTF TF-IDF
User-User
cosine similarity of user profile
TF-IUF
User-User
Network analysis of credible users, fractional cascading ✗
TF
✗
-
TF-IDF
Bouadjenek et al. (2016)
✗
✗
UTF-IUF
Al-Shamri (2016)
User-User
Correlation of demography based UIP
-
Proposed
User-User, Tag-Tag
HAC based tag clusters, semantic relatedness to have more real-world approximation of tag-tag relations & trust measurement in user’s society network
NTF
User’s tags & clusters recommendations User’s tags User’s preferences & neighbors recommendations without trust measurement User’s preferences & similar users recommendations without trust measurement User’s tags & similar users recommendations without trust measurement user’s preferences & credible users recommendations User’s tags user’s tags & cluster recommendations User’s tags & similar users recommendations without trust measurement User’s tags & similar users recommendations without trust measurement User tags & friends recommendation user’s tags user’s tags & interests in other tag using matrix factorization User’s topics & similar users recommendations without trust measurement User’s tags & recommendations from clusters and trusted socity relatives
Ternary Ternary Symmetric
Asymmetric
Ternary
symmetric
Ternary Ternary Asymmetric, ternary Ternary
Symmetric Ternary
✗ √ ✗ ✗ √
✗ √
√ ✗ ✗ √
√ √
Dynamic
Experiment domain
✗
✗
Del.icio.us
✗
✗
✗
✗
✗ ✗
✗ √
Del.icio.us, Dogear Del.icio.us, Last.fm Del.icio.us users of cyworld
✗ √
✗ √ √ ✗
✗
✗
√
✗ √ √
√
√
✗
✗ √
✗
Symmetric
✗
✗
√
√
√ Note 1: The symbol (✗)and ( ) represents the absence and presence of the feature respectively, while, (-) means no information available regarding the feature in a methodology. Note 2: Sparsity handling capability of various UIP construction methodologies are different, but here in the table only presence and absence of sparsity handler is shown. Note 3: Ternary relation type refers relationship of user, web resource and annotation, while, symmetric and asymmetric refers user-user or tag-tag relationship.
√
√
Ternary
Symmetric, Ternary
√
√
√
twitter
MovieLens
Facebook, Google query log Del.icio.us Del.icio.us, AOL query log NUS, Flicker
Del.icio.us
Wikipedia page update history MovieLens, Epinion Del.icio.us MovieLens demography data Del.icio.us, Wikipedia text
S. Goel and R. Kumar / Expert Systems With Applications 141 (2020) 112955
Shepitsen et al. (2008)
Scalable √
UIP sparsity handler
S. Goel and R. Kumar / Expert Systems With Applications 141 (2020) 112955
relationships of user and group relations of user tags respectively. The direct interest of a user has been identified by Bp protocol, whereas for prediction of indirect interest both Gp and Cp are responsible. Formally, a UIP can be represented by a vector as follows: Definition 2. A User Interest Profile (UIP) is a trade-off between strategy vectors generated by Bp , Gp , and Cp protocols. For a target − → user i, UIP is represented by U i as:
− → − →B − →G − →C U i = Ftrade ( U i p , U i p , U i p ) − →B − →G − →C where, U i p , U i p and U i p represent the strategy vectors, i.e., tags with their suitable weights obtained from Bp , Gp and Cp respectively. A strategy vector provides an insight into UIP corresponding to strategy under consideration, therefore, it is also known UIP vector. The trade-off of strategy vectors is depicted by function Ftrade . Now onwards, both strategy vector and UIP vector will be used interchangeably. Here, for the current research work, only three strategies have been selected for direct-indirect interest prediction to give protocol based architecture of UIP. But the proposed model is not only limited to these three-protocols; it can be extended upto n-protocol architecture depending on the number of strategies selected for direct-indirect interest prediction. Each protocol of the proposed model enlists the rules of implementing the corresponding strategy. The description of the entire model is divided into two sub-modules: first explains the process of direct interest identification, whereas second describes the process of indirect interest prediction of a user. The generic framework of the proposed protocol based architecture model of UIP is represented with the help of a flow diagram in Fig. 1. The numbers marked into a circle depicts the sequence in which that process must be performed to construct UIP in step 5. In case of Guild and Congregation protocols both steps are same numbered i.e. 4, which denotes that both can be performed in parallel. A detailed description of each protocol is provided in subsequent subsections. 3.1. Direct interest identification Direct interest of a user represents the user preferences which are explicitly defined by a user and identified from his/her social network activities. With the common goal to ultimately refine and personalize the user’s web search, researchers from both academia and industry are extensively analyzing user generated tagging-based profile construction techniques (Cai et al., 2014; Kumar & Kim, 2011). The foremost step in Bp is to identify ternary u relations, i.e., Rt,w of a user from the collaborative tagging information as shown in Fig. 1. The ternary relations represent “which user has annotated which web resource with which tag ”. A simplified view of the network formed by ternary relations for the whole dataset is presented in Fig. 2 as the complete view of the network is very complex. The tags collectively used by a user provide a valuable and precise description of user’s direct interest. But as per studies of human nature and mind, a person cannot like each and every item by equal amount of interest. There always exist some difference in the degree of interest. Therefore, appropriate weight must be assigned to the tag in accordance with the degree of interest a user holds for that tag. Here, collaborative tagging information has been obtained from del.icio.us; and tags represent the activities performed by a user. The tags used by a user himself for annotating various web pages and the degree of interest assigned to them by a user constitute the strategy vector corresponding to Bp which can be formally represented as follows: u
u
u
u
u
u
Definition 3. Let {t1 i , t2 i , . . . ., tn i } and {θ1 i , θ2 i , . . . ., θn i } are the sets of tags used by user i himself and the corresponding degree of interest he holds for those tags respectively. For target user i,
5
− →B UIP vector obtained by Base Protocol (Bp ) is represented by U i p as:
− →B p U i = (t1ui : θ1ui , t2ui : θ2ui , . . . ., tnui : θnui ) where, n depicts the cumulative count of tags used by a user i for u annotating various web pages; and θ j i is the degree of interest in u
tag t j i . The technique of Normalized Tag Frequency has been used u
for calculation of θ j i based on the assumptions made in the work by Cai et al. (2014). According to the assumption, degree of interest that a user holds for a particular tag must be inconsideration to the clause that if a user utilize a tag more frequently in comparison to other tags in order to annotate various web resources than user is more interested in that tag. Thus, merely calculating the tag frequency for degree of interest will result in biasness towards an active user who annotates web resources very frequently. u The mathematical formulation for calculation of θ j i is as follows:
θ jui =
cuj i
(1)
c ui u
where, c j i depicts the count of dataset records for which user ui has used tag tj to annotate the web pages; and cui is the count of web pages annotated by ui . Larger the value of degree of interest, u i.e., θ j i , more would be the probability of tj being user preferable tag of user ui . For instance, an example of Bp based profile construction process is explained in Fig. 2, where a part of the ternary relations of Users: Abu, Ajay, Geeta, and Ravi is considered. The User-Tag and User-Item matrices are constructed from these ternary relations. But here, the matrices shown are the actual matrices constructed for the entire ternary relation schema of the users under consideration on the actual dataset. Each cell ijth of User-Tag matrix will u represent the c j i in Eq. (1), whereas in the case of User-Item matrix each cell represents the number of times ui has annotated the web resource wj . The summation of all the values in a row corresponding to the user’s User-Item matrix represents the cui in Eq. (1). Each cell ijth of Weighted User-Tag matrix in Fig. 2 repreu sents the degree of interest θ j i a user ui has in tag tj calculated using Eq. (1). In the case of user Abu, the degree of interest he has u in tag Mathematics is 0.15, where the value of c j i is 19 and cui is 122 (summation of values in row corresponding to Abu in UserItem matrix). For visual representation, different colored codes and shapes have been used, i.e., brown colored circle for degree of inu terest; red colored circle and arc for c j i ; and green colored oval and arc for cui . Similarly, remaining degree of interest values for Abu, Ajay, Geeta and Ravi can be calculated, where cui values for Ajay, Geeta, and Ravi is 312, 497, and 79 respectively. 3.2. Indirect interest identification An appropriate description of a user’s interest is only provided by his own tags, but the construction of UIP using solely these tags is inefficient and incomplete. It all depends on the frequency of social network activities of a user because the amount of user’s information available on web is directly proportional to measure of user’s activities. Therefore, to avoid injustice on the account of an inactive user, some additional information must be linked with the user’s account for construction of an UIP. This additional information is not something new, it is already available around the user, but in a latent form, i.e., not explicitly stated. Gp and Cp protocols are used to infer the user’s additional information from his realworld society and implicit tag relationships respectively using various pattern and learning algorithms. The information predicted by Gp and Cp contributes to the indirect interest of a user.
6
S. Goel and R. Kumar / Expert Systems With Applications 141 (2020) 112955
Fig. 1. Protocol based architecture to construct the UIP.
3.2.1. Society relationship network Prediction of a qualitative additional information for UIP augmentation without considering society relationships is impossible. It is universally accepted fact that behavior of a person is strongly influenced by the society he keeps; and any change in a person’s behavior has a direct impact on his likings/dislikings. Society of a person constitutes friends, neighbors, colleagues, relatives or any person who is, directly or indirectly, related to a person under consideration. Thus, it is essential to analyze this relationship network in order to unearth the information for augmenting to a user’s profile. In the physical world, every person has a different type of social relationships with various persons in his society; some of them are taken into full confidence, while others remain mere acquaintances. The people in the first category are considered more trustworthy in comparison to others. Therefore, generally, a person likes to share everything with people in his inner private circle and also takes their advice. Considering this fact, society relationship types and Trust Matrix (TM) have been fused together into Gp for predicting additional information. A simplified view of the society relationships and corresponding trust matrix for the whole dataset is presented in Fig. 3 as the entire view of the network is quite complicated. Formally, TM is defined as:
denote the Trust Matrix (TM):
Definition 4. A Trust Matrix for user set U describes the extent of trust one user has over another. An adjacency matrix is used to
− →G p U i = (t1gpi : ω1i , t2gpi : ω2i , . . . ., tngpi : ωni )
T Mi, j = Tri, j where, TMi,j depicts the extent of trust ui has over uj , which is measured by trust score Tri, j . The magnitude of Tri, j has been calculated using Eq. (2):
Tri, j =
⎧ ⎪ ⎨1, 1
⎪ ⎩ |IPCui |
∃u j ∈ IPCui ∗ |ux |,
0,
∃ux ∈ IPCui ∧ ∃uy ∈ IPCu j (ux = uy )
(2)
otherwise.
where, IPCu j and IPCui represent the set of users in the inner private circle of users uj and ui respectively. In Gp protocol, trust score of only those users has been calculated who are members of the first two-domains in society relationship network of a user under consideration. IPCui constitutes direct social relatives of ui and have the same extent of trust, i.e., equal to 1. gp
gp
gp
gp
Definition 5. Let {t1 i , t2 i , t3 i . . . ., tn i } and {ω1i , ω2i , ω3i . . . ., ωni } are sets of tags recommended to user i by Gp protocol; and the corresponding degree of interest that the user i may hold for those tags respectively. For target user i, UIP vector obtained by Guild − →G Protocol (Gp ) is represented by U i p as:
S. Goel and R. Kumar / Expert Systems With Applications 141 (2020) 112955
7
Fig. 2. Illustration of UIP constructed using base protocol.
where, n depicts the cumulative count of tags recommended to user i by Gp protocol; and ωij is the degree of interest that user gp
i may hold for tag t j i . In order to make tag recommendations to ui and assigning a degree of interest to the tags Algorithm 1 has been used by Gp . For more details, refer Algorithm 1. Hypothesis 1: Given the results of various parameters, does the partial UIP constructed based on Bp and Gp protocol information is more efficient than UIP’s corresponding to the baseline methodologies?
H0 : E(B p ∪G p ) ≥ Ebaselines where, H0 is the null hypothesis E(B p ∪G p ) = efficiency of partial UIP constructed using information generated by Bp and Gp protocol Ebaselines = efficiency of UIP constructed using baselines For instance, an example of Gp based UIP construction is explained in Fig. 3, where a part of the social relationship network of user Ajay has been considered. The Trust Matrix is constructed from Eq. (2), and Weighted User-Tag Matrix is obtained from Bp protocol as shown in Fig. 2. Here, Algorithm 1 has been used to make tag recommendations with a suitable degree of interest to Ajay. Firstly, users having trust score equal to or more than 0.8 have selected from social relatives of Ajay; and then top 10 tags from the sorted version of Bp based UIPs corresponding to selected users have been chosen for the recommendation. There might be a possibility that same tag is recommended to Ajay by his mul-
tiple social relatives. Therefore, before making any recommendation, records corresponding to duplicate tags are removed keeping only those records which have a high degree of interest value. After removing the duplicate records, User-Tag matrix is updated for Ajay using recommended additional information by Gp and information by Bp as shown in Fig. 3. Similarly, for other users also, Gp based tag recommendation can be made, but here only Ajay is considered just for making visualization simple and maintaining the consistency between all examples. Here, to describe the example, different colored codes and shapes have been used, i.e., brown colored circle shows the degree of interest in recommended tags. Red colored circle and arc represents the trust score; and green colored oval and arc stand for Bp based UIP of selected social relatives. Each row ith of updated User-Tag matrix corresponds to partial UIP based on Bp and Gp for a user ui . For details regarding the process of calculating cell values in updated User-Tag matrix, refer Algorithm 1. 3.2.2. Tag relationship network Mining of additional information only on the basis of tags recommended by social relatives of a user, for the augmentation of his UIP, is also not sufficient. The construction of UIP using this additional information and user generated tag information can represent user’s interest to some extent, but still UIP remains incomplete. Therefore, some amount of supplementary information about the user is still required to further enlarge his preference boundaries. The analysis of real-world tag relationships can unearth various hidden facts about a pair of tags and can prove to be a valuable
8
S. Goel and R. Kumar / Expert Systems With Applications 141 (2020) 112955
Fig. 3. Illustration of partial UIP constructed using base and guild protocol.
input for UIP augmentation. Therefore, these relationships have been incorporated in the proposed model and identified by Cp protocol. Firstly, a Tag-Tag Relationship matrix (TTRm ) is constructed in which each cell measures the semantic relatedness level between the corresponding tags. Secondly, after computing TTRm , clustering is performed on its row vectors. Semantic relatedness is capable of entertaining any type of relationship between the tag pairs, whereas other primitive methods like syntactic, co-occurrence and semantic similarity can take care of only one type of relationship. For example, semantic similarity exists between mathematics and physics, but the tags mathematics and books are semantically related. The possibility that a person who likes mathematics may also like physics is much less than the possibility of a person liking both mathematics and books. Therefore, Cp can also entertain the tags which are neither semantically nor syntactically similar, but are semantically related to each other in the real-world. The preliminary requirement for computation of TTRm is the tag-vector which are obtained with the help of Word2vec model developed by Mikolov, Yih, and Zweig (2013). Word2vec is a composition of several inter-related neural-network models, which cooperatively produce the multi-dimensional wordvectors for every distinct word in a large corpus of text. Out of several predictive modeling methods, Word2vec is computationally more efficient than its peers. For this reason, it has been used in Cp . The word-vectors corresponding to each and every tag in a collaborative tagging dataset under consideration was selected from the collection of word-vectors created by Word2vec model. But before making any selection, stemming of all the tags was performed in order to avoid different vector representations of two similar tags. For example, both books and book are two similar tags in the real-world, but they are considered as two different tags, if stem-
ming is not performed. In Cp , cosine distance between the vector representation of two tags is used to compute the semantic relatedness level between those tags using Eq. (3);
→ − → − ti · t j T T Rm (ti , t j ) = − → − → | ti | ∗ | t j |
(3)
− → − → where, ti and t j are vector representations of the tags ti and tj re− → − → spectively. Greater the value of cosine distance between ti and t j , higher would be the real-world semantically relatedness between ti and tj . After computing the Tag-Tag Relationship matrix, clustering of the tags in a collaborative tagging dataset is performed using the row vectors corresponding to those tags in TTRm . A cluster comprises of various tags that are associated with each other in some context, but their strength of association varies from one tag pair to another. Moreover, it can also not be denied that a cluster is a composition of multiple related contexts. Out of several eminent clustering algorithms, Hierarchical Agglomerative Clustering (HAC) is incorporated into Cp . The key factors behind its incorporation are its ability to accommodate a large number of unevenly sized clusters and a voluminous amount of data without any degradation of scalability and efficiency. The additional information identified by Cp using TTRm and tag clusters is quiet advantageous in the augmentation of a user’s UIP. cp
cp
cp
cp
Definition 6. Let {t1 i , t2 i , t3 i . . . ., tn i } and {ρ1i , ρ2i , ρ3i . . . ., ρni } are sets of tags recommended to user i by Cp protocol and the corresponding degree of interest that the user i may hold for those tags respectively. For target user i, UIP vector obtained by Congregation
S. Goel and R. Kumar / Expert Systems With Applications 141 (2020) 112955
9
Algorithm 1: Guild protocol (Gp ) recommended UIP. − →B Input : Trust Matrix T M, user set U and B p protocol based UIP U i p Output: Tags recommended by G p and their corresponding degree of interest Initialize : index1, 2 ← 1 − →B Cal l er _B p ← U i p for m ← 1 to ncol s(Cal l er _B p ) do Cal l er _tags ← Cal l er _tags.append (Cal l er _B p (1, m )) for each user j in U do if T M (i, j ) = 0 then Relative_T score(1, index1 ) ← j Relative_T score(2, index1 ) ← T M (i, j ) index1 ← index1 + 1
ncol s() count number o f columns in matrix
trust score must not equal to zero
Relative_T score ← sort (Relat ive_T score ) for k ← 1 to ncol s(Rel ative_T score ) do if Relative_T score(2, k ) ≥ 0.8 then user _relative ← Relative_T score(1, k ) B p _user ← fetch B p based UIP of user user _relative for l ← 1 to ncols(B p _user ) do if (B p _user (1, l ) NOT IN Caller _tags) AND (l ≤ 10) then Cal l er _G p (1, index2 ) ← B p _user (1, l ) degree_o f _interest ← Relative_T score(2, k ) ∗ B p _user (2, l ) Cal l er _G p (2, index2 ) ← degree_o f _interest index2 ← index2 + 1 Cal l er _G p ← sort (Cal l er _G p ) temp(1, 1 ) ← Cal l er _G p (1, 1 ) temp(2, 1 ) ← Cal l er _G p (2, 1 ) index1 ← 2 for r ← 2 to ncol s(Cal l er _G p ) do f lag ← 0 for z ← ncols(temp) to 1 do if Cal l er _G p (1, r ) == temp(1, z ) then f lag ← 1 break
index1 and index2 are index pointers
sort w.r.t trust score in decreasing order
sort w.r.t degre_o f _interest in a decreasing order
if f lag == 0 then temp(1, index1 ) ← Cal l er _G p (1, r ) temp(2, index1 ) ← Cal l er _G p (2, r ) index1 ← index1 + 1 Cal l er _G p ← temp
− →C protocol (Cp ) is represented by U i p as:
− →Cp U i = (t1cpi : ρ1i , t2cpi : ρ2i , . . . ., tncpi : ρni )
where, n depicts the cumulative count of tags recommended to user i by Cp protocol; and ρ ij is the degree of interest that user cp
i may hold for tag t j i . In order to make tag recommendations to ui and assigning a degree of interest to the tags, Algorithm 2 has been used by Cp . For more details, refer Algorithm 2. Hypothesis 2: Given the results of various parameters, is the partial UIP constructed based on Bp and Cp protocol information is more efficient than UIP’s corresponding to the baselines methodologies?
H0 : E(B p ∪Cp ) ≥ Ebaselines where, H0 is the null hypothesis E(B p ∪C p ) = efficiency of partial UIP constructed using information generated by Bp and Cp protocol
Ebaselines = efficiency of UIP constructed using baselines An example of Cp based UIP construction is explained in Fig. 4, where a part of the actual tags dataset is considered. The Tag-Tag Relationship Matrix TTRm has been constructed from Eq. (3) using vector representation of tags generated by Word2vec model, whereas cluster set Sclus has been obtained from TTRm and HAC. Weighted User-Tag Matrix has been obtained from Bp protocol as shown in Fig. 2. Here, Algorithm 2 has been used to make tag recommendations with suitable degree of interest to Ajay. Firstly, the clusters clus having the tags generated by Ajay have been identified as clus1 and the tags of clus1 that are not in Ajay profile are listed in a list, i.e., School, Book. Then, the semantic relatedness of tags listed in the list with each tag of Ajay has been measured using TTRm . Out of all the potential candidates, i.e., School, Book only Book is recommended to Ajay. User-Tag matrix is updated for Ajay using recommended additional information by Cp and information by Bp as shown in Fig. 4. Similarly, for other users also, Cp based tag recommendation can be made. Here, to describe the example, different colored codes and shapes have been used, i.e., brown colored circle shows the degree of interest in recommended tags. Red
10
S. Goel and R. Kumar / Expert Systems With Applications 141 (2020) 112955
Algorithm 2: Congregation protocol recommended UIP. − →B Input : Set of clusters Sclus, Tag-Tag Relationship matrix T T Rm and B p protocol based UIP U i p . Output: Tags recommended by C p and their corresponding degree of interest Initialize : index1, 2 ← 1 index1 and index2 are index pointers for each clus in Sclus do clus is the cluster in Sclus index1 ← 1 − →B Cal l er _B p ← U i p for k ← 1 to ncol s(Cal l er _B p ) do ncol s() count number o f column in matrix user _tag ← Cal l er _B p (1, k ) for each tag j in clus do if user _tag == j then t ag_list ← t ag_list.append ( j ) list o f user s tags present in clus for each tag j in clus do total _rel ← 0 for each tag l in tag_list do if j == l then t otal _wt ← t otal _wt + Cal l er _B p (2, l ) break else t otal _rel ← t otal _rel + T T Rm (l , j )
T ot al weight o f user s t ags present in clus T otal semantic rel atedness b/w cluster tag & user s tags
avg_rel = total _rel /l ength(tag_list ) temp_ pro f ile(1, index1 ) ← j temp_ pro f ile(2, index1 ) ← avg_rel index1 ← index1 + 1 temp_ pro f ile ← sort (temp_ pro f ile ) avg_t agwt ← tot al _wt/length(t ag_list ) t ag_rec ← length(t ag_list )/2 for m ← 1 to tag_rec do Cal l er _C p (1, index2 ) ← temp_ pro f ile(1, m ) degree_o f _interest ← temp_ pro f ile(2, m ) * avg_tagwt Cal l er _C p (2, index2 ) ← degree_o f _interest index2 ← index2 + 1
sort w.r.t. avg_sim in a decreasing order number o f t ags recommended by clus tag recommended by clus to user
Empty(tag_l ist) t otal _wt, t otal _rel ← 0
Del ete al l el ements tags_o f _tag_list
Cal l er _C p = sort (Cal l er _C p )
sort w.r.t. degree_o f _interest in a decreasing order
colored circle and arc have been used showing the contribution of TTRm and Sclus, whereas green colored oval and arc stand for Bp based UIP of the user to whom recommendations are made. Each row i of updated User-Tag matrix corresponds to partial UIP based on Bp and Cp for a user ui . For the process of calculating cell values in updated User-Tag matrix and selection of cluster, refer Algorithm 2. Lastly, the additional information about the user generated by Gp and Cp protocols is clubbed with the one provided by Bp with the help of Ftrade function for the augmentation and construction of a user’s UIP. The information lies in the form of tag recommendations and the corresponding degree of interest that may hold for those tags. Full-fledged final UIP of user ui has been constructed by Ftrade using Eq. (4).
Ui, j =
⎧ ⎪ θ ui , ⎪ ⎪ ⎪ jui ⎪ ⎨ω j , ⎪ ⎪ ⎪ ρ ui , ⎪ ⎪ ⎩ j 0,
− → ∃ t j ∈ U Bi p − → − → (∃ t j ∈ U Gi p ∧ ∃ tk ∈ U Ci p (t j = tk )) − →G p − →C ∨ (∃ t j ∈ U i ∧ ∃ t j ∈ / U i p) − →Cp − →G p ∃ t j ∈ U i ∧ ∃ t j ∈/ U i otherwise.
are UIP vectors corresponding to information provided by Bp , Gp u u u and Cp protocol respectively. Here, θ j i , ω j i and ρ j i are the degrees of interest that user ui has in the tag tj as predicted by Bp , Gp and Cp protocol respectively. For ui , all the tags and their respective degrees of interest in User-Tag matrix, for which degree of interest value is non-zero, constitutes full-fledged final UIP of user ui , i.e., − → U i. Hypothesis 3: Given the results of various parameters, is the constructed full-fledged UIP based on Bp , Gp and Cp protocol information more efficient than UIPs corresponding to baselines methodologies and partial UIPs ?
H0 : E(B p ∪G p ∪Cp ) > E(B p ∪G p ) and E(B p ∪G p ∪Cp ) > E(B p ∪Cp ) and E(B p ∪Cp ) ≥ Ebaselines
(4)
where, Ui,j depicts the cell value of User-Tag matrix corresponding − →B − →G − →C to full-fledged final UIP for user ui and tag tj . U i p , U i p and U i p
where, H0 is the null hypothesis E(B p ∪G p ∪C p ) = efficiency of full-fledged final UIP constructed using information generated by Bp , Gp and Cp protocol. E(B p ∪G p ) = efficiency of partial UIP constructed using information generated by Bp and Gp protocol. E(B p ∪C p ) = efficiency of partial UIP constructed using information generated by Bp and Cp protocol.
S. Goel and R. Kumar / Expert Systems With Applications 141 (2020) 112955
11
Fig. 4. Illustration of partial UIP constructed using base and congregation protocol.
Ebaselines = efficiency of UIP constructed using baselines. 4. Experimental setup To demonstrate the effectiveness of the proposed model, i.e., Society-Tag Relationship Protocol based Architecture for UIP Construction, extensive experiments were performed on a dataset of voluminous size. This section provides a detailed description of the datasets used, evaluation metrics, and baseline methodologies with which comparison is made. To implement all the algorithms of the proposed model Python 2.7.10 Shell has been used. The training and testing hardware used to obtain the results of experiments consisted of Dell Workstation T5600 with Intel Xenon e5 CPU and 8GB RAM running windows platform. A collaborative tagging dataset obtained from del.icio.us has been used as the experimental data. Del.icio.us is a very popular social bookmark service provider where a user can tag or annotate any web page. This dataset consists of approximately 1867 users, 69,226 URLs corresponding to different web pages, and 53,388 tags (10358 unique tags). There are nearly 437,593 ternary relations, i.e., cross-product of users, tags and web pages. On an average, approximately 24 tags have been used by every user to annotate 65 web pages. Before conducting the experiments, pre-processing of the dataset has been performed as there were some noisy elements in the dataset which could hinder the performance of the proposed model. Some of the major pre-processings performed were: (1) Tag tokenization as some tags were just a collection of multiple words concatenated using special symbols, e.g., “good@mathematics#books”: (2) Tag stemming using
Porter’s algorithm: (3) Discarded the tags which were only stopwords, too personal such as “exciting” and meaningless words: (4) Finally, discarded all non-English tags. Several python scripts with the help of regular expressions were created to pre-process the dataset. The del.icio.us dataset has been used by Bp and Gp protocols to extract information about user’s own tags and society relationships respectively which is then used to generate refined information about user’s interest. But for the working of Cp , Tag-Tag Relationship matrix TTRm is required which has been computed using vector representation of tags. A Word2vec model trained on an English Wikipedia dataset has been used to construct a giant corpus of word-vectors. Wikimedia (2017) had thoroughly analyzed and described the dataset, and also made it freely available to the public for research. Before proceeding for any model training, dataset must be converted from XML to text format. As an output of Word2vec model training, a vocabulary of approximately 880,802 words and their respective word-vectors have been obtained. For the evaluation of proposed model, the del.icio.us dataset has been randomly split into two parts, i.e., training set, and testing set with 5-fold cross-validation. Training set constitutes 80% records of the dataset, whereas testing set constitutes the remaining 20%. The original user-tag data for each user has been divided specifically. The training set has been used to construct a user’s UIP on the basis of proposed model and various baseline methodologies, whereas the testing set evaluates the efficiency of obtained UIPs. The baseline methodologies selected for the purpose of making a comparison with the proposed model and evaluation parameters have been discussed in the subsequent subsections.
12
S. Goel and R. Kumar / Expert Systems With Applications 141 (2020) 112955 Table 2 Metrics for experimental evaluation of user interest profile. Metric
Mathematical formulation
Precision@K (P@K)
1 n loc (ti ) n i=1
Mean Reciprocal Rank (MRR)
1 1 n n i=1 loc (ti )
Relative Improvement in Location (RIL) Completeness (comp)
1 n n i=1
1 1 − loc (ti ) p loc (ti )s
Description where, loc(ti ) depicts the location of target tag ti in user’s UIP which is pre-sorted in an ascending order w.r.t. degree of interest. If loc(ti ) is among topmost K tags of user interest, then, loc(ti ) is equal to 1, otherwise 0.
1 |Ttest ∩ TUIP | n
4.1. Evaluation metrics Four widely accepted metrics by information retrieval community to evaluate the efficiency and effectiveness of the proposed model have been employed in the experiments conducted for this study. Each metric, i.e., P@K (Wu et al., 2015), MRR (Xie et al., 2014), RIL(Shepitsen et al., 2008) and comp analyzes the different aspects of a constructed UIP by the proposed model, but their ultimate objective remains the same. For mathematical formulation and description of the selected evaluation metrics, refer Table 2. In each metric, n refers to the total count of tags in testing set; and ti represents a target tag of the testing set. 4.2. Baseline methodologies In the experiments, four baseline methodologies of UIP construction have been selected to seek a comparison with the proposed model in order to validate the effectiveness of the proposed one. The first methodology denoted by BM1 is given by Bouadjenek et al. (2016), where the approach of user-termfrequency and inverse-document-frequency was used to measure the tag weights. The second methodology, denoted by BM2 , is a personalization methodology for web search designed by Cai et al. (2014), where the weights of tags are normalized term frequency values. The third methodology, denoted by BM3 , is presented by Kumar et al. (2014) for the construction of a UIP using own tags of user and its augmentation. So, it is an aggregation of user information predicted by term-frequency and inversedocument-frequency values and clusters of semantically similar tags. Out of two UIPs, i.e., svdCUIP and modSvdCUIP designed by Kumar et al. (2014), modvdCUIP has been selected here for comparison as it is more efficient than its sibling. The last methodology, denoted by BM4 , has been proposed by Shafiq et al. (2015), where not only the tags used by a user himself, but also the tags recommended by latent friendship circle of a user are used for constructing a UIP. In both BM3 and BM4 , a single UIP augmentation strategy has been employed, but in BM1 and BM2 nothing is available for augmentation. In order to separately analyze the impact of additional information predicted by Gp and Cp protocols of the proposed model, two partial UIPs have also been constructed. First partial UIP is a joint venture of information by Bp and Gp protocol, whereas second partial UIP is Bp and Cp trade-off. To denote first partial UIP, Bp plus Gp has been used; and Bp plus Cp denotes the second partial UIP, whereas full-fledged final UIP is denoted by Bp plus Gp and Cp . 5. Results and discussion The results of experiments conducted to compare the performance of the proposed model with various baseline methodolo-
where, loc(ti ) depicts the location of target tag ti in user’s UIP which is pre-sorted in an ascending order w.r.t. degree of interest. where, loc(ti )p and loc(ti )s depict the location of tag ti in user’s UIP constructed by the proposed model and baseline methodology respectively. where, Ttest and TUIP depict tags in testing set and set formed by the tags in user’s UIP respectively.
gies are discussed in this section. In all, 35 clusters have been prepared to implement clustering based strategy, both in the proposed model or baseline methodology, for predicting additional information in order to augment the UIP of a user. As per Provalis (2017), there are 44 categories in wordnet. But the categories corresponding to the words which are too personal or define some type of feelings have been removed. The metrics described in Table 2 have been used to evaluate and quantify the comparison. Polar-radar graph shown in Fig. 5 illustrates the comparison of constructed UIPs using the proposed model and baseline methodologies based on precision, i.e., P@K evaluation metric. It is an important and ordinarily used metric for measuring the accuracy of obtained UIP; and it highlights the implications of keeping more percentage of user’s favourite tags in the list of tags in his UIP. Different color codes for each baseline, partial UIPs and full-fledged final UIP have been used to visualize the results so that even a small difference in values can be noticed. The experiment has been repeated with different values of K in each subsequent iteration. The results obtained in each iteration have been averaged over all the users in the dataset to give an average precision (inner radial axis) as it is not feasible to represent the results for every user. It can be clearly observed from the behavior of radars exhibited in Fig. 5 that for every value of K, all variants of the proposed model have outperformed each baseline methodology by a considerable margin. For K = 35, i.e., P@35, the partial UIP constructed from information predicted by Bp plus Cp protocol outperforms the nearest baseline methodology, i.e., BM4 by 7.16%, whereas Bp plus Gp by 11.88% w.r.t. BM4 . In comparison to least performer, i.e., BM1 at P@35, Bp plus Cp protocol has managed a margin of 66.99%, whereas in the case of Bp plus Gp , it is 74.38%. The full-fledged final UIP constructed by combined information from Bp plus Gp and Cp surpassed the dominant baseline, i.e., BM4 by 32%, and least performer by 105.74%. Even at K = 5, a precision of 0.28693 can be achieved by full-fledged final UIP which is also 37.9% greater than the dominant one. Mostly, for each value of K as shown in Fig. 5, the precision increased from inner angular to outer angular axis. In the case of full-fledged final UIP, all values of precision are appearing towards outer angular axis which are more as compared to baselines and partial UIPs. Therefore, on the basis of precision results, it can be said that UIP constructed by the proposed model is more efficient. Greater the value of K, higher would be the precision can be achieved by the proposed model. In Fig. 6, the performance achieved on the basis of MRR for full-fledged final UIP and both partial UIPs of the proposed model has been compared with UIPs constructed by baseline methodologies. The figure clearly reflects that the full-fledged final UIP constructed by Bp plus Gp and Cp has attained the highest MRR value of 0.21095 among all baseline and partial UIPs. It outperforms the dominant baseline, i.e., BM4 by 36.06%, and the least performing
S. Goel and R. Kumar / Expert Systems With Applications 141 (2020) 112955
13
Fig. 5. Comparative analysis of the proposed and baseline methodologies on the basis of average precision metric.
baseline by 95.08%. Even partial UIPs corresponding to Bp plus Cp and Bp plus Gp protocols have managed to attain a positive margin of 14.44% and 27.79% from the dominating baseline respectively. Similarly, to the case of P@K metric, for the sake of simplicity, MRR has also been averaged over the total number of users in the dataset. MRR indicates the presence of most favourite tags of a user near the header of tag list in his UIP. Greater the value of MRR, more would be the accuracy of a UIP. Fig. 7 shows the performance of constructed UIPs based on RIL evaluation metric. It depicts that there is a relative improvement of more than 25% in the locations assigned to the target tags inside the hierarchy of tags in user’s UIP by full-fledged final UIP as compared to the baselines. The maximum and minimum RIL for BM1 and BM4 has been observed as 37.88% and 25.3% respectively. Relative improvement in full-fledged final UIP corresponding to both partial UIPs has also been measured as 9.5% and 6.7% for Bp plus Cp and Bp plus Gp respectively. Fig. 8 make a comparative analysis between full-fledged final UIP and partial UIPs, constructed by the proposed model, and UIPs generated by baseline methodologies on the basis of comp metric. Basically, comp is a measure of target tags that belong to testing set considered for the experiment and also present among the most preferable tags of the user inside his UIP. The location of tag inside the UIP doesn’t affect the value of comp. This metric is quite important to quantify the ability of a strategy used to correctly predict the additional information for augmenting the UIP of a user. All the target tags that belong to a testing set, in reality, are the user’s own tags, but due to their spiting between training and testing set some of them move to the testing set without any instance left in training set. So, this metric quantifies that how many types of tags are correctly predicted by UIP construction methodology into the
finally obtained UIP. Similarly, to P@K, MRR and RIL, for the sake of simplicity, comp is also averaged over all the users in the dataset. It can be clearly observed from Fig. 8 that the full-fledged final UIP given by the proposed model in the current work outperforms all other baselines by predicting 82.39% target tags and incorporates them into the final UIP. Even partial UIPs, i.e., Bp plus Cp and Bp plus Gp have correctly predicted 70.84% and 75.87% of target tags respectively. Greater the percentage of predicted target tags, i.e., comp, more would be the probability of user’s interest tags in the UIP. If a methodology doesn’t employ any UIP augmentation strategy, then only the tags which have an instance corresponding to them in the training set can be predicted by it and remains the least performer like BM1 and BM2 . The testing of hypotheses discussed in Section 3 is performed on the basis of results shown in Figs. 5–8, while Table 3 describe √ these results. The symbol tick ( ) represents that the null hypothesis H0 is true; and symbol cross (✗) means that the null hypothesis H0 is false. Whereas ENC means that the experiment has not been conducted to measure the efficiency of partial UIP over baseline methodology for a particular metric. NA simply depicts that the testing of hypothesis to prove the efficiency of one partial UIP over another is not required. Based on the trends as shown in Figs. 5–8 and results of hypotheses testing in Table 3, certain facts can be yield. Firstly, baselines BM1 and BM2 , without any strategy to identify additional information for the augmentation of a UIP, are always the least performers. The other remaining baselines, i.e., BM3 and BM4 equipped with some augmentation strategy are able to perform better than BM1 and BM2 , but their performance is not good as the proposed model. The performance rating, based on experimental results of various evaluation metrics, of full-fledged final UIP
14
S. Goel and R. Kumar / Expert Systems With Applications 141 (2020) 112955
Fig. 6. Comparative analysis of the proposed and baseline methodologies on the basis of average MRR metric.
Table 3 Hypothesis Testing of partial and final UIP. Metric
UIP construction methodology
P@K
BM1 BM2 BM3 BM4 Bp plus Bp plus BM1 BM2 BM3 BM4 Bp plus Bp plus BM1 BM2 BM3 BM4 Bp plus Bp plus BM1 BM2 BM3 BM4 Bp plus Bp plus
MRR
RIL
Comp
Hypothesis1 √ √ √ √
Hypothesis 2 √ √ √ √
Cp Gp
NA NA √ √ √ √
NA NA √ √ √ √
Cp Gp
NA NA ENC ENC ENC ENC NA NA √ √ √ √
NA NA ENC ENC ENC ENC NA NA √ √ √
Cp Gp
Cp Gp
NA NA
✗ NA NA
Hypothesis 3 √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √
S. Goel and R. Kumar / Expert Systems With Applications 141 (2020) 112955
15
Fig. 7. Percentage of average relative improvement in the location of tags in ranking hierarchy by the proposed model as compared to the baseline methodologies.
under the proposed model is the highest among all baselines. Even the performance of partial UIPs, i.e., Bp plus Gp and Bp plus Cp is quite better than UIPs constructed by various baselines. Thus, it justifies the claim made by the work present in this paper that it is essential to employ an UIP augmentation strategy. Secondly, the contribution percentage of additional information about the user as predicted by Gp and Cp protocols for the construction of fullfledged final UIP cannot be clearly estimated as on one occasion the Gp protocol contributes in a better way than Cp and vice versa. However, the margin is substantially less in both the cases. The overall scenario presents that the combined effect of both Gp and Cp protocols towards the construction of an efficient full-fledged final UIP also establishes the claim this study that single strategy is not sufficient for UIP augmentation. The prediction and learning algorithms developed as a part of proposed protocol based architecture for UIP construction offers many advantages which are as follows: •
•
Firstly, the proposed model facilitates the users by constructing a distinct multi-strategy based user interest profile for every user in order to enlist their preferences with a suitable degree of interest in each preference. In contrast to this, the work presented by other well-known UIP construction methodologies from literature have either without or single strategy for UIP construction. Secondly, the approach followed in our work for incorporating the UIP augmentation strategies into the proposed model produces more real-world approximations of various relation-
•
•
•
ships, whereas, UIP by other methodologies is just a trade-off between some statistical tricks which have no significance in the real-world. Thirdly, the dataset employed to learn the relationship of a tag pair is far bigger than the one used by other well-known methodologies. Fourth, our proposed model is flexible and expendable which can be easily transition from 3-protocol based UIP construction to N-protocol. Finally, our proposed model is the generalized one that can be utilized by either web search engine or recommender system.
However, inspite of having so many advantages there are two crucial limitations of the proposed model: •
•
First is that the users for which no information is available as they are completely new users and had not yet performed any activity on the collaborative tagging platform. Constructing the UIP of such users is not possible unless they start performing some tagging, sharing, friends making activity. The second one is that our proposed model continuously requires a updated collaborative tagging activity records of the users from a tagging platform in order to refurbish their UIP for more better personalized experience. But such dataset may not be always available due to privacy or security issues. One possible solution is to seek government help for collaboration of search engines and various tagging platforms.
16
S. Goel and R. Kumar / Expert Systems With Applications 141 (2020) 112955
Fig. 8. Comparative analysis of the proposed and baseline methodologies on the basis of average completeness metric.
6. Conclusion This paper focuses on addressing the issue of predicting additional information about the user interest in order to construct a strong user profile, i.e., UIP. To achieve this objective, rather than utilizing only society or tag relations, both society relationship network and real-world tag relationships have been utilized. The employment of a single or no strategy at all for UIP augmentation cannot lead to a strong and efficient UIP as confirmed by the results of experimental evaluations. The current study proposes a protocol based architecture for UIP construction which incorporates both explicitly defined society and real-world tag relationships into user generated information as UIP augmentation strategies. The first protocol has been used to obtain user’s own tags with appropriate weights; and it serves as a foundation ground for the remaining protocols. Thus, it is named as Base protocol (Bp ). The remaining two protocols, i.e., Guild protocol (Gp ) and Congregation protocol (Cp ) have been used to recommend tags from society relationships of user and group relations of user tags respectively. The proposed model is not limited to these three protocols, but can be extended to n by including more strategies. Certain key differences exist between the proposed model and former methodologies of UIP construction. The first one is the exploration of explicitly defined society relations and quantification of it by Gp protocol as the user’s level of trust varies from one individual to another. The second difference lies in the analysis of latent relationships between the pair of tags based on their real-world semantic relatedness strength by a Cp protocol. Word-
vectors generated through Word2vec model, corresponding to the tags have been used to calculate semantic relatedness between a pair of tags. Extensive experiments have been conducted to evaluate the effectiveness of proposed protocol based architecture model for UIP over various baselines using the del.icio.us dataset. The contribution percentage of additional information about the user as predicted by Gp and Cp protocols for the construction of full-fledged final UIP has also been examined by constructing two partial UIPs. According to the results of various evaluation metrics and testing of hypotheses, the full-fledged final UIP constructed by the proposed model has outperformed each and every baseline. Even the partial UIPs, i.e., Bp plus Gp and Bp plus Cp are far better performers than those constructed by various baselines. Both Gp and Cp have increased the overall performance of final UIP to a large extent. It also lends support to our claim of constructing a strong and efficient UIP. As the current research work examines only the specific objectives formulated for the study, there is sufficient scope for further improvements that we can accomplish. Every possible improvement discussed here represents the goals of our future research work in the field of search personalization. (1) Mostly every approach studied in the literature including the proposed one have reflected the long term user interest without any consideration to short term interest of the user. For example, a user, who had previously prefer spicy food, may now be showing interest in boiled vegetables, this issue is known as Information Requirement (IR) drift. The system responsible for constructing user’s UIP must also take care of dynamic drift or temporal drift in user interest. But
S. Goel and R. Kumar / Expert Systems With Applications 141 (2020) 112955
incorporating temporal element in user’s UIP brings the different challenges like change in interest is permanent or seasonal. (2) In addition to separate vector for short term interest of a user, UIP construction methodology must also incorporate tag transference to mark the movement of tag representing a user interest from one level to another if UIP is constructed in multi-level fashion i.e. highly preferable, averagely preferable, occasionally preferable. (3) Most of the research studies on UIP construction has assumed that all the annotations made by a user represent the user’s favorite things and utilizing each one of them to create a single UIP vector is not reasonable. As annotations made by a user not only include the user’s preferable item but also somethings which make a person annoyed. Therefore, a separate vector of annoying tags should be constructed as the performance of a personalized recommender system highly affected by it. (4) The plan is to extend the current work further, by adopting sentiments aspect of the tags for cluster formation in a Cp protocol of the proposed model in order to improve the partial and overall performance. (5) The effectiveness of personalization methodology design is no doubt highly dependent on UIP quality but the role of a resource profile can also not be neglected. Generally, all the annotations made to a web resource by different user’s are taken together to create a collective resource profile. But sometimes there is a presence on intentional or unintentional annotations which can mislead the real purpose of the web resource. So while creating a resource profile for personalization of web search these kinds of outliers must be filtered out. Declaration of Competing Interest The authors declare that they do not have any financial or nonfinancial conflict of interests. Credit authorship contribution statement Shubham Goel: Conceptualization, Methodology, Data curation, Formal analysis, Investigation, Validation, Writing - original draft. Ravinder Kumar: Conceptualization, Visualization, Resources, Writing - original draft, Writing - review & editing. Acknowledgments All persons who have made substantial contributions to the work reported in the manuscript (e.g., technical help, writing and editing assistance, general support), but who do not meet the criteria for authorship, are named in the Acknowledgements and have given us their written permission to be named. If we have not included an Acknowledgements, then that indicates that we have not received substantial contributions from non-authors. Supplementary material Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.eswa.2019.112955. References Al-Shamri, M. Y. H. (2016). User profiling approaches for demographic recommender systems. Knowledge-Based Systems, 100, 175–187. doi:10.1016/j.knosys.2016.03. 006. Bao, S., Xue, G., Wu, X., Yu, Y., Fei, B., & Su, Z. (2007). Optimizing web search using social annotations. In Proceedings of the 16th international conference on world wide web (pp. 501–510). ACM. doi:10.1145/1242572.1242640. Bouadjenek, M. R., Hacid, H., & Bouzeghoub, M. (2013). Laicos: An open source platform for personalized social web search. In Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1446–1449). ACM. doi:10.1145/2487575.2487705. Bouadjenek, M. R., Hacid, H., Bouzeghoub, M., & Vakali, A. (2016). Persador: Personalized social document representation for improving web search. Information Sciences, 369, 614–633. doi:10.1016/j.ins.2016.07.046.
17
Cai, Y., Li, Q., Xie, H., & Min, H. (2014). Exploring personalized searches using tagbased user profiles and resource profiles in folksonomy. Neural Networks, 58, 98–110. doi:10.1016/j.neunet.2014.05.017. Chawla, S. (2016). A novel approach of cluster based optimal ranking of clicked URLs using genetic algorithm for effective personalized web search. Applied Soft Computing, 46, 90–103. doi:10.1016/j.asoc.2016.04.042. Chirita, P.-A., Firan, C. S., & Nejdl, W. (2006). Summarizing local context to personalize global web search. In Proceedings of the 15th ACM international conference on information and knowledge management (pp. 287–296). ACM. doi:10. 1145/1183614.1183658. Du, Q., Xie, H., Cai, Y., Leung, H.-f., Li, Q., Min, H., & Wang, F. L. (2016). Folksonomybased personalized search by hybrid user profiles in multiple levels. Neurocomputing, 204, 142–152. doi:10.1016/j.neucom.2015.10.135. Goel, S., & Kumar, R. (2018). Brownian motus and clustered binary insertion sort methods: An efficient progress over traditional methods. Future Generation Computer Systems, 86, 266–280. doi:10.1016/j.future.2018.04.038. Hannon, J., Bennett, M., & Smyth, B. (2010). Recommending twitter users to follow using content and collaborative filtering approaches. In Proceedings of the fourth ACM conference on recommender systems (pp. 199–206). ACM. doi:10.1145/ 1864708.1864746. Huang, C.-L., Yeh, P.-H., Lin, C.-W., & Wu, D.-C. (2014). Utilizing user tag-based interests in recommender systems for social resource sharing websites. KnowledgeBased Systems, 56, 86–96. doi:10.1016/j.knosys.2013.11.001. Kim, Y. A., & Park, G. W. (2013). Topic-driven SocialRank: Personalized search result ranking by identifying similar, credible users in a social network. KnowledgeBased Systems, 54, 230–242. doi:10.1016/j.knosys.2013.09.011. Kumar, H., & Kim, H.-G. (2011). Using folksonomies for building user interest profile. In Proceedings of the international conference on user modeling, adaptation, and personalization (pp. 438–441). Springer. doi:10.1007/978- 3- 642- 22362- 4_46. Kumar, H., Lee, S., & Kim, H.-G. (2014). Exploiting social bookmarking services to build clustered user interest profile for personalized search. Information Sciences, 281, 399–417. doi:10.1016/j.ins.2014.05.008. Liu, F., & Lee, H. J. (2010). Use of social network information to enhance collaborative filtering performance. Expert systems with applications, 37(7), 4772–4778. doi:10.1016/j.eswa.2009.12.061. Liu, F., Yu, C., & Meng, W. (2004). Personalized web search for improving retrieval effectiveness. IEEE Transactions on Knowledge and Data Engineering, 16(1), 28–40. doi:10.1109/TKDE.2004.1264820. Luo, X., Ouyang, Y., & Xiong, Z. (2012). Improving neighborhood based collaborative filtering via integrated folksonomy information. Pattern Recognition Letters, 33(3), 263–270. doi:10.1016/j.patrec.2011.10.016. Makvana, K., Shah, P., & Shah, P. (2014). A novel approach to personalize web search through user profiling and query reformulation. In Proceedings of the IEEE international conference on data mining and intelligent computing (ICDMIC) (pp. 1–10). IEEE. doi:10.1109/ICDMIC.2014.6954221. Mikolov, T., Yih, W.-t., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 746–751). Morris, M. R., Teevan, J., & Panovich, K. (2010). What do people ask their social networks, and why?: A survey study of status message q&a behavior. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 1739–1748). ACM. doi:10.1145/1753326.1753587. Noll, M. G., & Meinel, C. (2007). Web search personalization via social bookmarking and tagging. In The semantic web (pp. 367–380). Springer. doi:10.1007/ 978- 3- 540- 76298- 0_27. Provalis, R. (2017). Wordnet based categorization dictionary. https:// provalisresearch.com/products/content-analysis-software/wordstat-dictionary/ wordnet- based- categorization- dictionary/. Accessed: 2017-9-25. Shafiq, O., Alhajj, R., & Rokne, J. G. (2015). On personalizing web search using social network analysis. Information Sciences, 314, 55–76. doi:10.1016/j.ins.2015.02.029. Shen, X., Tan, B., & Zhai, C. (2005). Context-sensitive information retrieval using implicit feedback. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (pp. 43–50). ACM. doi:10.1145/1076034.1076045. Shepitsen, A., Gemmell, J., Mobasher, B., & Burke, R. (2008). Personalized recommendation in social tagging systems using hierarchical clustering. In Proceedings of the 2008 ACM conference on recommender systems (pp. 259–266). ACM. doi:10.1145/1454008.1454048. Sugiyama, K., Hatano, K., & Yoshikawa, M. (2004). Adaptive web search based on user profile constructed without any effort from users. In Proceedings of the 13th international conference on world wide web (pp. 675–684). ACM. doi:10. 1145/988672.988764. Tan, B., Shen, X., & Zhai, C. (2006). Mining long-term search history to improve search accuracy. In Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 718–723). ACM. doi:10.1145/ 1150402.1150493. Tapia-Fernández, S., Rodríguez, E., Velázquez, J., Seco, F., & Jiménez, A. R. (2015). Location aware web: concept, protocol and system. In Proceedings of the IEEE international conference on industrial technology (ICIT) (pp. 3424–3429). IEEE. doi:10.1109/ICIT.2015.7125607. Teevan, J., Dumais, S. T., & Horvitz, E. (2005). Personalizing search via automated analysis of interests and activities. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (pp. 449–456). ACM. doi:10.1145/1076034.1076111. Vallet, D., Cantador, I., & Jose, J. M. (2010). Personalizing web search with folksonomy-based user and document profiles. In Proceedings of the euro-
18
S. Goel and R. Kumar / Expert Systems With Applications 141 (2020) 112955
pean conference on information retrieval (pp. 420–431). Springer. doi:10.1007/ 978- 3- 642- 12275- 0_37. Wikimedia, F. (2017). The data dumps of wikipedia. https://meta.wikimedia.org/ wiki/Data_dumps. Accessed: 2017-9-13. Wu, H., Pei, Y., Li, B., Kang, Z., Liu, X., & Li, H. (2015). Item recommendation in collaborative tagging systems via heuristic data fusion. Knowledge-Based Systems, 75, 124–140. doi:10.1016/j.knosys.2014.11.026. Xie, H., Li, Q., Mao, X., Li, X., Cai, Y., & Rao, Y. (2014). Community-aware user profile enrichment in folksonomy. Neural Networks, 58, 111–121. doi:10.1016/j.neunet. 2014.05.009. Xie, H., Li, X., Wang, T., Lau, R. Y., Wong, T.-L., Chen, L., . . . Li, Q. (2016). Incorporating sentiment into tag-based user profiles and resource profiles for personal-
ized search in folksonomy. Information Processing & Management, 52(1), 61–72. doi:10.1016/j.ipm.2015.03.001. Xu, S., Bao, S., Fei, B., Su, Z., & Yu, Y. (2008). Exploring folksonomy for personalized search. In Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (pp. 155–162). ACM. doi:10.1145/1390334.1390363. Yang, X., Guo, Y., Liu, Y., & Steck, H. (2014). A survey of collaborative filtering based social recommender systems. Computer Communications, 41, 1–10. doi:10.1016/j. comcom.2013.06.009. Zhou, Y., Xie, X., Wang, C., Gong, Y., & Ma, W.-Y. (2005). Hybrid index structures for location-based web search. In Proceedings of the 14th ACM international conference on information and knowledge management (pp. 155–162). ACM. doi:10.1145/1099554.1099584.