Journal Pre-proof A hybrid personalized scholarly venue recommender system integrating social network analysis and contextual similarity Tribikram Pradhan, Sukomal Pal
PII: DOI: Reference:
S0167-739X(19)30710-1 https://doi.org/10.1016/j.future.2019.11.017 FUTURE 5288
To appear in:
Future Generation Computer Systems
Received date : 14 March 2019 Revised date : 23 October 2019 Accepted date : 8 November 2019 Please cite this article as: T. Pradhan and S. Pal, A hybrid personalized scholarly venue recommender system integrating social network analysis and contextual similarity, Future Generation Computer Systems (2019), doi: https://doi.org/10.1016/j.future.2019.11.017. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2019 Published by Elsevier B.V.
Journal Pre-proof
A hybrid personalized scholarly venue recommender system integrating social network analysis and contextual similarity Tribikram Pradhana,∗, Sukomal Pala of Computer Science and Engineering, Indian Institute of Technology (BHU), Varanasi, Uttar Pradesh, India
pro of
a Department
Abstract
re-
Rapidly developing academic venues throw a challenge to researchers of identifying the most appropriate ones that are in-line with their scholarly interests and of high relevance. Even a high-quality paper is sometimes rejected due to a mismatch between the area of the paper, and the scope of the journal attempted to. Recommending appropriate academic venues can, therefore, enable researchers to identify and take part in relevant conferences and to publish in impactful journals. Although a researcher may know a few leading high-profile venues for her specific field of interest, a venue recommender system becomes particularly helpful when one explores a new field or when more options are needed. We propose DISCOVER: A Diversified yet Integrated Social network analysis and COntextual similaritybased scholarly VEnue Recommender system. Our work provides an integrated framework incorporating social network analysis, including centrality measure calculation, citation and co-citation analysis, topic modeling based contextual similarity and key-route identification based main path analysis of a bibliographic citation network. The paper also addresses cold start issues for a new researcher and a new venue along with the considerable reduction in data sparsity, computational costs, diversity, and stability problems. Experiments based on the Microsoft Academic Graph (MAG) dataset show that the proposed DISCOVER outperforms state-of-the-art recommendation techniques using standard metrics of precision@k, nDCG@k, accuracy, MRR, F − measuremacro , diversity, stability and average venue quality.
lP
Keywords: Recommender system, Social network analysis, Citation analysis, Topic modeling, Factorization model, Main path analysis 1. Introduction
15
20
urn a
10
Jo
5
Recommender systems are used to recommend users different objects based on their personal likings by using 25 various data analysis techniques [1–3]. Generally, academic recommender systems mostly provide recommendations for collaborators [4, 5], papers [6, 7], citations [8– 11], and/or academic venues [12–14]. These systems have been useful to academicians as they objectively provide 30 users with personalized information services [14–16]. Although there have been quite a few works on different academic recommendations, very little body of work exists in the literature on academic venue recommendations, albeit started quite early [17]. 35 The researchers, in general, intend to publish in academic venues that acknowledge high-quality papers and participate in academic conferences or workshops that are relevant to their area of research [18, 19]. Among various problems that researchers confront, an important task is to 40 identify relevant publication venues. The task is nowadays being increasingly difficult due to the continuous increase in the number of research areas and dynamic change in the ∗ Corresponding
author Email addresses:
[email protected] (Tribikram Pradhan ),
[email protected] (Sukomal Pal) Preprint submitted to Journal of Future Generation Computer Systems
scope of journals [17, 20]. More collaborations are taking place among disciplines in the research communities, which is leading to reduced compartmentalization at the coarse level but continuous increase in the number of venues in interdisciplinary areas [17, 21]. For example, Microsoft Academic Graph (MAG) 1 dataset, a heterogeneous graph of information relating to a collection of scientific publication records and their relationship within that collection had 23,404 journals and 1,283 conferences in 2016 [22, 23] and is now having 48,668 journals and 4,344 conferences as of now 2 . As the research horizon expands, researchers find it challenging to remain up to date with new findings, even within their own disciplines [24, 25]. Moreover, with the passage of time, researchers’ own interests expand, evolve, or adapt in rapidly changing subject areas needing for information on appropriate venues in the changed scenario [26]. Increase in interdisciplinary research areas also poses great challenges to research institutes and their libraries as they strive to understand information-seeking behaviors and dynamic information needs of the users [17]. Information specialists need prompt and seamless information on researchers’ reading priorities in order to make decisions on 1 http://research.microsoft.com/en-us/projects/mag/ 2 https://academic.microsoft.com
(as of 08.01.2019) October 24, 2019
Journal Pre-proof
70
75
80
85
90
pro of
65
To what extent do the publishing house services like Elsevier Journal Finder and Springer Journal Suggester fulfill the needs of venues recommendation for researchers? To what extent do the existing approaches handle cold-start problems like new researchers, new venues, and other issues such as data sparsity, scalability, diversity, and stability, etc.?
(a) A researcher from industry has made a breakthrough Based on our study of the above approaches, we obin her research area. To collaborate with her peers110 served several issues with the existing state-of-the-art techfrom academia, she may want to find a suitable acaniques. We propose a Diversified yet Integrated Social netdemic venue (conference) that she is not very aware work analysis and COntextual similarity-based scholarly of. VEenue Recommender system (DISCOVER). It is devel(b) A junior researcher, i.e., a researcher who is at the ini-115 oped taking into account recent advances in social network tial stage of her research and has no or very few pubanalysis, including centrality measure calculation, citation lications, intends to extend her research area. But a and co-citation analysis, contextual similarity and main lack of knowledge about appropriate academic venues path analysis of a bibliographic citation network. Contexbecomes a challenge for her to explore newer areas. tual similarity through a hybrid approach of both topic 120 modeling and factorization techniques is proposed. Key (c) A veteran researcher knows her research area very contributions of this work are the followings. well, but when she ventures into a new field or works in an interdisciplinary area, may look for a cross-field • To deal with the “cold-start” issues like new revenue recommendation. searchers and new venues in venue recommendation, an integrated approach of social network analysis, (d) A journal may merge with some other related journal contextual similarity, citation, and co-citation analywith modified scopes and objectives. The researchers125 sis are taken into consideration. DISCOVER works may not be aware of such developments. irrespective of researchers’ past publication records In order to recommend a relevant venue of high quality, and co-authorship network, rather focuses only on the we need to focus from the perspective of a researcher’s work in hand. New venues with no citations available needs and development of the particular research area in130 are also considered for abstract similarity and proquestion on the following issues. vided equal opportunities in the recommendation. (i) What are the most relevant venues of publications for • “Data sparsity” 3 and “diversity” 4 are two major a researcher in question? issues in an academic venue recommender systems. The bibliographic network is a huge graph with rel(ii) How can a researcher find high-quality venues? 135 atively few edges where the number of edges is close (iii) What are the most suitable conferences/workshops a to the minimal number of edges. Handling this huge researcher should participate in, for a given area? graph with few inter-node connections is a severe challenge. Also, recommender systems often suffer from Most of the existing techniques depend on co-authors’ the problem of suggesting papers only from a sinpast publications and/or ratings of the venues provided 140 gle publisher or just from a few publishers. We used by other researchers to perform such a venue recommendifferent techniques like keyword-based filtering, cendation. A few approaches use random walk model, topictrality measures like betweenness, closeness, degree, based similarity for the same. Based on our literature sur-
re-
60
To what extent is network-based venue recommendation model able to satisfy the requirement of relevant venues in all situations?
lP
55
Is the existing content-based filtering model able to recommend relevant venues from the perspective of the researcher needs?
urn a
50
venue subscriptions instead of relying only on the venues’ RQ2: impact factor or on users’ explicit requests. On the other hand, researchers also need to know about new venues to remain updated. They usually get updates from colleagues/supervisors, friends, internet, and RQ3: books but often the information is not sufficiently com-100 prehensive and/or appropriate as their research actually demands. The researchers, therefore, sometimes end up RQ4: approaching inappropriate venues resulting rejections, delays in publication and/or compromise in the quality of the publication. Venue recommendation for either jour-105 nals or conferences, in particular, has, therefore, become an important area of research in recent times [27]. Out RQ5: of many reasons for its increasing importance, some are given below as emerging scenarios [14].
Jo
45
vey, we attempt to explore the following research questions (RQs).
3
Sparsity, in this context, denotes the number of empty, or zerovalue entries (often useless) in a given researcher-venue matrix. The less the amount of sparsity, the better is the usefulness. 4 Diversity means the average dissimilarity between all pairs of recommended venues in the result set.
RQ1: To what extent is existing collaborative filtering based model able to recommend relevant venues from the 95 perspective of researcher needs? 2
Journal Pre-proof
165
170
175
180
185
190
• To address the issue of “stability” 6 , a topic modeling based contextual similarity is incorporated into the proposed system. We develop a content-aware210 system based on the title, keywords, and abstract similarities with a given seed paper. The abstract similarity is computed using LDA, Okapi BM25+, and non-negative matrix factorization (NMF) techniques. Later on, score-based fusion technique such as CombMNZ is applied to fuse the similarity score of215 both LDA and NMF in order to maintain the stability of contextual similarity.
A recommender system assists a person’s decisionmaking process. It provides her options based on her requirements, especially when immediately available information is inadequate to make an informed decision [3]. Adomavicius and Tuzhilin [2] authored a comprehensive review of recommender systems and suggested mainly three types of recommender systems based on their working principles. In addition, we also include network-based recommendation [14]. We attempt to provide here necessary background in the academic recommender systems according to their taxonomy.
re-
160
2. Background
2.1. Collaborative filtering based recommendation (CF) Collaborative recommender systems (or collaborative filtering systems) predict the utility of items for a user based on the items previously rated by other users who have similar likings or tastes [2]. In the field of academic recommendations, Yang et al. [28] proposed a model to explore the relationship between publication venues and writing styles using three kinds of stylometric features: lexical, syntactic and structural. In another paper, Yang et al. [29] used collaborative filtering model incorporating writing style and topic information of papers to recommend venues. Yang et al. [13] proposed another joint multi-relational model (JMRM) of venue recommendation for author-paper pairs. Hyunh et al. [30] proposed a collaborative knowledge model (CKM) to organize collaborative relationships among researchers. The model quantified the collaborative distance, the similarity of actors before recommendations. Yu et al. [31] proposed a prediction model which used collaborative filtering for a personalized academic recommendation based on continuity feature of a user’s browsing content. Liang et al. [1] proposed a probabilistic approach consolidating user exposure that was modeled as a latent variable, inducing its incentive from data for collaborative filtering. Alhoori et al. [17] recommended scholarly venues taking into account the researcher’s reading behavior based on personal references and the temporal factor of when references were added.
lP
155
• In venue recommender systems,“scalability” 7 is one of the major issues. In the proposed system, the most 220 important papers in a citation network are identified through a number of stages. But at the very first step, we very selectively choose the potential candidates for the subsequent steps. This keyword-based search strategy is computationally linear in database 225 size and does not increase the output size, even with an increase in the input size. Therefore, there is no considerable increase in overall computation when input data grows.
urn a
150
In addition to the techniques mentioned above, we normalize the overall score of Bibliographic coupling (BC) and Co-citation score (CC) by a hop-distance195 d(I, k) 5 . This normalized score provides incentives to the papers that are close to the paper of interest (I) and penalize the distant papers, although they may have a similar value of combined strength of BC and CC. These techniques, at each stage, substantially reduce the number of candidate papers that are potentially related to a given seed paper. 90% reduc-200 tion occurs after the initial step of the keyword-based search, and there are reductions in other steps as well down the line. Hence there is no sparsity issue present in the proposed system. The proposed system also provides a diversified recommendation from different205 publishers.
• Extensive experiments were conducted using MAG230 dataset on DISCOVER. The proposed system is seen to outperform several other state-of-the-art venue recommendation systems with substantial improvements in precision, nDCG, accuracy, MRR, and average 235 venue-quality (ave-quality).
Jo
145
more elaborate problem description in Section 3. In Section 4, we describe the proposed model, including search strategy and preprocessing efforts. Experimental details, including data description, evaluation metrics, and parameter selection, are discussed in Section 5. Section 6 covers experimental results and discussions, including the limitations of our study. We finally conclude with future directions in Section 7.
pro of
eigenvector and HITS score, etc. at various stages one after the other.
This paper is organized as follows. We visit the stateof-the-art including motivation in Section 2. We provide 5 d(I, k) indicates the hop-distance from paper I to paper k in the citation network. 240 6 Stability denotes the change in ranked recommendations when new papers are introduced 7 Scalability denotes the ability of a system to accommodate new researchers and/or papers.
3
2.1.1. Problems with CF approaches (RQ1) Although CF has been quite popular in the last decade for scholarly venue recommendation, most of them suffer from the following drawbacks.
Journal Pre-proof
250
255
(i) CF approaches are less effective when there are not enough ratings present in the researcher-venue matrix. The recommendations may not be useful in case of a new researcher who lacks publication history. (ii) The techniques are not likely to recommend a new venue or a less popular venue as the venue lacks in publication statistics. Therefore, some relevant venues may be missed.
BM25+ to recommend journals. But, recommendations are restricted to Elsevier publishers only [38].
295
(iii) Computational cost is high because of an extensive300 number of articles, venues, and researchers involved are taken into consideration during processing, and thus, scalability is a challenge. (iv) Researcher-venue matrix that is at the core of the techniques is exceptionally sparse as most of the re-305 searchers publish and cite a few articles and are involved with very few academic venues.
275
280
285
290
(i) CBF approaches suffer from limited content analysis, which can significantly reduce the quality of recommendation [7, 41]. Most of the time, they require the full text of the paper and thus, are not usable at the early stage of paper-writing [34]. Usually, the abstract is not sufficient to extract the necessary reliable and relevant information. (ii) New venues are less likely to be recommended as the models prefer venues with a high number of papers published therein.
re-
(iii) The models provide a poor recommendation to a new researcher who lacks publication records. (iv) The recommendations are heavily biased towards the past area of research of a researcher and therefore not suitable when one changes her area of interest or works in an interdisciplinary field.
lP
270
urn a
265
2.2. Content-based recommendation (CBF) In CBF, users are recommended items similar to the310 ones the user preferred in the past. In case of academic recommender systems, a user is recommended papers, collaborators, and/or venues similar to that the user liked earlier. Medvet et al. [34] considered the title and abstract of papers to recommend scholarly venues considering n-gram based Canvar-Trenkle, two-steps-LDA, and 315 LDA+clustering to retrieve language profile, a subtopic of papers, and identification of the main topic as a research field. Errami et al. [35] proposed a model called eTBLAST to recommend journals based on abstract similarity using the z-score of a set of extracted keywords and weighted formula of “Journal score”. Schurmie et al. [36] proposed320 the Journal/ Author Name Estimator (Jane) 8 on biomedical database MEDLINE to recommend journals based on abstract similarity. They exploited a weighted k-nearest neighbors and Lucene similarity score in order to rank ar325 ticles. Wang et al. [7] presented a content-based publication recommender system (PRS) on computer science exploiting soft-max regression and chi-square based feature selection techniques. Liang et al. 37 presented a novel frame330 work for possible corss-disciplinary collaboration. Recently, few online services have started providing support for suggesting journals using keywords, title, and abstract matching. These services include Elsevier Journal Finder 9 [38] , Springer Journal Suggester 10 , Edanz Journal Selector 11 and EndNote Manuscript Matcher 12 etc.335 Elsevier Journal Finder requires only the title and abstract of a paper and uses noun phrases as features and Okapi
Jo
260
2.2.1. Problems with CBF approaches (RQ2) CBF or topic-based models use author’s profile, the content of their papers as well as that of the papers published at a specific venue [39]. Most approaches use LDA for topic modeling and rank venues based on the similarity of venues that published similar papers [40]. Other salient issues with CBF approaches are as follows.
pro of
245
8 http://jane.biosemantics.org 9 http://journalfinder.elsevier.com 10 http://journalfinder.com
340
11 https://www.edanzediting.com/journal-selector 12 http://endnote.com/product-details/manuscript-matcher
4
2.3. Hybrid recommendation (HR) Hybrid approaches combine collaborative and contentbased methods avoiding certain limitations of contentbased and collaborative systems. Wang et al. [7] proposed a hybrid article recommendations incorporating social tag and friend information. Boukhris et al. [42] suggested a hybrid venue recommendation based on the venues of the co-citers, co-affiliated researchers, the coauthors of the target researcher. It is based on bibliographic data with citation relationships between papers. Minkov et al. [43] introduced a method of recommending future events. Tang et al. [44] introduced a cross-domain topic learning (CTL) model to rank and recommend potential cross-domain collaborators. Xia et al. [45] proposed a socially aware recommendation system for conferences. Similarly, Cohen et al. [46] explored the domain of mining specific context in a social network to recommend collaborators. 2.4. Network-based recommendation (NB) On top of the above approaches, the approach based on a network representation of the input data has gained considerable attention in the recent past [47], mainly to alleviate the problems of the CF and CBF approaches. Here, a social graph is built among the authors based on co-authorship. An edge exists between two authors if they
Journal Pre-proof
365
370
375
2.4.1. Problems with NB approaches (RQ3) Few limitations of the approaches are as follows.
385
390
395
(iii) The ranking algorithm used in both the systems only works well if there are enough sample papers (at least 100) in each journal. However, for some new journals, there may not be that many published papers and the existing systems fail to recommend the relevant journals [38]. (iv) There is no provision for the recommendation of highquality conferences or workshops (In few domains, conferences/workshops may have high visibility, e.g., information retrieval, machine learning, etc.). 2.4.3. Cold-start and other issues (RQ5) Most of the approaches discussed above use venue rating analysis, content analysis, and co-authorship network or participation history of researchers to recommend scholarly venues but suffer from various cold start issues for new researchers, new venues or less popular venues and also other issues like data sparsity, scalability, diversity, and stability, etc. Few limitations of the existing approaches are as follows. (i) In CF-based venue recommender system, data sparsity is a major issue that arises due to the sparseness of the matrix of researcher-venue ratings. It also suffers from the cold-start issue for new researchers.
(ii) CBF performs poorly due to ambiguity in text comparison and also suffers from cold start issues of new researchers and new venues.
(i) Irrespective of actual content, each paper authored by430 the same set of authors will receive the same recom(iii) Two major issues of CBF approaches are limited mendation. content analysis and over-specialization, and due to (ii) Recommendations are very poor for a new researcher which, the lack of diversity is a severe problem in this who does not have any past publication records. type of approaches [3, 7, 41]. (iii) It cannot recommend a new venue as the model is 435 (iv) Scalability is also a major challenge in CF and based on the publication history of venues. CBF based venue recommender systems as procedures therein are not linear in input-size. (iv) Venues with less popularity among the co-authors of a given author are seldom recommended, although (v) Most of the time, stability is a severe issue in CF and content-wise they may be appropriate. NB based venue recommender systems.
Jo
380
pro of
360
(ii) EJF suggests a maximum of ten journals, whereas SJS, provides a maximum of twenty. Sometimes they provide less than ten journals due to the constraints of unavailability of similar contextual papers.
re-
355
(i) Both the systems restrict suggestions to their own publications only (either Elsevier or Springer) - which is a severe limitation from users’ point of view. Suggested journals are often found not matching with the topic of the paper.
lP
350
urn a
345
co-author at least one paper [40, 48]. The venue having the highest count among the papers within n-hops from a given author-node is recommended. Klamma et al. [27] proposed a Social Network Analysis (SNA) based method using collaborative filtering to rec-400 ognize most similar researchers and rank obscure events by integrating the rating of most similar researchers for the recommendations. Silva et al. [49] proposed a threedimensional research analytics framework (RAF) incorporating relevance, productivity, and connectivity parameters. 405 Pham et al. [50] used the number of papers of a researcher in a venue to determine her rating for that venue using the clusters on social networks. Later, Pham et al. [51] presented clustering techniques on a social network of researchers to identify communities in order to generate410 venue recommendations. They also applied traditional CF calculations to provide the suggestions. Chen et al. [20] introduced a model AVER to recommend the scholarly venues to a target researcher. This approach utilizes a random walk with restart (RWR) model on the co-publication network incorporating author-author and author-venue relations. Later, Yu et al. [14] extended415 AVER to personalized academic venue recommendation model PAVE where the topic distribution of researcher’s publications and venues were utilized in LDA. Luong et al. [52] identified suitable publication venues by investigating the co-authorship network, most frequent420 conferences, normalized score based most successive conferences. Luong et al. [40] in another work recommended suitable publication venues by investigating authors’ coauthorship networks in a similar field. Xia et al. [45] provided venue recommendation using Pearson correlation and characteristic social information of conference partic-425 ipants to enhance smart conference participation.
2.4.2. Problems with freely available on-line services440 (vi) The existing techniques have an undue bias against (RQ4) the new venues and new researchers coming into the Elsevier Journal Finder (EJF) and Springer Journal system with fewer publication records [35, 40, 51]. Suggester (SJS) are two popular freely available online journal recommender systems based on CBF. Some limOur proposed solution attempts to address the issues disitations of these systems are as follows. cussed above, as described in the following sections. 5
Journal Pre-proof
A. Data Preprocessing : This step aims to structure, arrange, and organize the dataset suitable for faster extraction of relevant papers.
Bibliographic Dataset
Title Keywords Abstract
DISCOVER
Bibliographic Dataset
O U T P Recommendation U T
Top N venues
Data pre-processing (A)
pro of
I N P U T
Figure 1: Overview of DISCOVER
Content analysis (B)
445
3. Problem description
450
Let G = (V, E) be a citation graph with n papers, such that V = {p1 , p2 , ..., pn }, and each directed edge e = (pi , pj ) ∈ E represents a citation from paper pi to pj . We use the following two phrases to describe the citation network.
No (Set II)
re-
(i) References of pi represent the set of papers which are referred by the paper pi .
Title and keywords matching (B)
(ii) Citation to pj denotes the set of papers which have used the paper pj as a reference.
465
470
lP
460
For the rest of the paper, we use the above two phrases to define the graph around vertex pi . Let each paper pi be published in a particular venue vi . So now we have, S = {v1 , v2 , ..., vn } be a predefined set of publication venues (not all vi ’s are necessarily unique). Given an input paper (seed paper) p0 , the venue recommendation task is to recommend an ordered list of suitable publication venues (v01 , v02 ,..., v0k ) related to the seed paper p0 , such that v01 is the most relevant and v0k is k-th most relevant venue in the decreasing order of relevance or suitability. Hence it is primarily a ranking problem. We need first to figure out the set of papers which are closely related to the seed paper and then rank them. Venue recommendations are provided if the title, keywords, and abstract of a seed paper are given to the system as input (Fig. 1).
urn a
455
480
Yes (Set I)
Social network analysis (C) Title and keywords matching (B)
Citation analysis (D)
Select top papers
Main path analysis (E)
Abstract matching
Result fusion (F)
Top N venue recommendation
Figure 2: Organizational architecture of DISCOVER 485
4. The functional architecture of DISCOVER
We introduce the overall architecture of the proposed system DISCOVER with its operational methods. DISCOVER is designed for shortlisting academic venues to490 make a personalized recommendation for researchers. It has a layered approach where each layer performs a specific task used by the next layer.
Jo
475
Papers with citations?
4.1. Framework of DISCOVER 495 DISCOVER is based on social network analysis where the association of nodes in networks and the significance of individual nodes are considered. The overall process comprises the following six steps (Fig. 2). 6
B. Content Analysis (field of study, keyword, title, abstract matching): This module is introduced to filter relevant papers based on fields of study, keywords, title, and/or abstract matching. This step may be utilized multiple times at different point of time when one or more types of matching are done. C. Social Network Analysis: Various centrality measures like degree, closeness, betweenness, eigenvector and HITS score, etc. are calculated of the papers shortlisted in previous Steps A and Step B that will be used down the line. D. Citation Analysis: This module is used to accomplish two objectives. First, identification of the
Journal Pre-proof
505
most similar papers as papers of interest (I) with the help of title, keyword and abstract similarity. Then, by applying bibliographic coupling (BC), co-citation scores(CC) and a new distance measure, the most related papers to the paper of interest (I) are selected.
F25
Level 0
E. Main Path Analysis: To determine the most influential papers in the citation network, traversal counts like search path count (SPC) is used. Key-route search is employed to select significant links during both local and global search to identify the global keyroutes.
F22
Level 1
F13
Level 2
F. Result Fusion: The final ranking of scholarly venues is done based on abstract similarity using LDA and NMF. A score based fusion technique (combMNZ) is applied to leverage the advantages of both methods.
F14
F1
530
535
540
545
550
F16
S2
F2
F17
S3
F1
F4
F18
F19
S4
F5
S12
F20
S6
S5
F3
F21
F1
F6
F1
4.2.1. Keyword-set construction and organization Keywords are identified as available under the keywords tag of research papers. Stopwords, if any, are removed and keywords are stemmed using Snowball stemmer [53]. The keywords of all the papers are fetched in a bottom-up fashion, and their union is stored at between two levels of FOS (rectangular boxes in Fig. 3). For example, the keyword-set between Level-3 and Level-2 (for example, S1 ) is constructed by concatenating the keywords of the field of study at Level-3 (F1 , F2 , and F3 ). Similarly, keywordset between Level-2 and Level-1 (S7 ) is constructed by concatenating the keywords from papers in Level-2 FOSs, i.e. (F13 , and F14 ).
lP
re-
The original MAG dataset is organized in a hierarchical fashion divided into 4 levels (Level 0 being the root and Level 3 the leaves) where levels correspond to the field of study. Levels are related in super class - sub class relation where lower levels are subsumed in upper levels. But, a555 paper belonging to a Level-3 node can be a part of multiple Level-2 nodes (in case of inter-disciplinary fields), and, following the same logic, of multiple Level-1 nodes. Hence, locating the field of study to get all the relevant papers using only the keywords is not very straight-forward, but560 often can be very tedious and time-intensive. The dataset, therefore, is reorganized using a hybrid binary tree. Fig. 3 illustrates the modifications done. We use the relation between the fields of study (FOS) in the original graph. FOSs related to each other are provided with a pair-wise confidence score based on their similarity. A score of 1 implies that the two fields are very similar (part of or dependent on each other) and a lower score im-565 plies lesser similarity. If the two fields are not similar at all, their confidence score will be 0. We use confidence scores among FOSs to divide them into two groups of children nodes at each level, one greater than the average confidence score (left children) and the other equal to and less570 than (right children) the average confidence score of all the children nodes with the parent node (FOS). For example, field of study F22 , which is at Level-1 has three children’s like F13 , F14 and F15 . These children are divided into two groups:
urn a
525
F15
S11
S10
Figure 3: Hybrid binary tree to hierarchy of field of studies
4.2. A. Data preprocessing
Jo
520
S9
F2
The details of the above steps are described below. 515
F3
Level 3
F24
F23
S8
S7
S1
510
S14
S13
pro of
500
4.3. B. Content Analysis (Keyword-based search strategy)
We traverse the above tree in a top-down fashion to search for papers. We extract only those papers having high similarity with the keywords of a given seed paper. A queue Z is created and maintained to keep track of the visited nodes. At the start, Z contains only the root nodes corresponding to the field of study (e.g., F25 ). A node is popped from Z, and the given set of keywords is matched with its (popped node) left, right and parent sets of keywords of the popped node separately. Upon a match, the number of matches is checked and proceeded in the following way.
(i) Field of studies F13 and F14 have a high confidence score with the field of study F22 and are placed on the left-hand side of the field of study F22 .
(i) Case 1 : If the number of matches is greater on any one side (either left or right), the other side is ignored. All the nodes of the greater side are added to queue Z.
(ii) The field of study F15 shows a less confidence score with the field of study F22 and are placed on the right-580 hand side of the field of study F22 .
(ii) Case 2 : If the number of matches on both sides is equal, nodes from both the sides are added to the queue Z.
575
7
F4
Journal Pre-proof
(iii) Case 3 : If the number of matches of the parent is equal to that of the greater side or all three are equal, even the parent node is added to the queue Z.
600
605
610
4.3.1. Illustrative example of keyword-based search Suppose we are matching the keywords for a given paper pm . Initially, we have only one field of study, F25 in the queue Z. This node is popped, and keywords of pm are matched with the keyword sets at level 0, namely S13 and S14 and the parent field keywords (F25 ). Let the left subtree have a clear maximum number of matches. We then proceed in that direction and successively push F22 and F23 into the queue Z. For each of these nodes, the search is done similar to the that done for F25 . Finally, when we reach Level 3, now, we have all the relevant fields of study ids in our queue, including the primitive lower level keywords and some higher-level keywords matching with the given keywords. Suppose at Level 3, only fields of study like F1 , F2 , F3 and F14 are left in the queue. In Fig. 3, all papers belonging to fields F1 , F2 , F3 and F14 are fetched and used for further analysis. For the papers so selected, the following procedures are adopted. 630 4.4. C. Social Network Analysis (Calculation of different centrality measures) Depending on the availability of citations, the shortlisted papers will be divided as follows (Fig. 2). 635
620
urn a
(a) all papers whose citations exist in the dataset (Set-I) 615
Calculate Closeness Centrality (CC)
Calculate Eigenvector Centrality (CE)
Calculate HITS Centrality (CH)
Union of papers having any centrality score >= Avg. score
pro of
595
Calculate Betweenness Centrality (CD)
Calculate Degree Centrality (CD)
Output: A set of unique papers
Figure 4: Graph model for centrality measures
4.4.2. Degree centrality (CD ) In a graph, the degree of a node is the number of edges that are adjacent to that node [57]. Higher the number of neighbors of a given node, the higher its impact is. Degree centrality of a paper p is defined as
re-
590
This process is repeated until we reach leaf level (or Level-3). There is a duplication of data in the proposed hybrid binary tree, but the computation time is enormously reduced as at every step, just like binary search trees, the unmatched half side of the tree is not considered.
CD (p) = indeg(p) + outdeg(p)
(2)
where indeg(p) is the number of research articles or papers citing to paper p and outdeg(p) is the number of papers p is referring to. For each paper p, in-degree (p) is computed. The papers whose in-degree is greater than or equal to average indegree of the network are shortlisted for further computation. Later on again, the average score of degree {indeg(p) + outdeg (p)} is taken into consideration for removing papers. We adopt such two-stage filtering in order to ensure that i) first, highly cited papers are not missed and ii) no new papers which cite a lot of papers are missed either.
lP
585
Input: Generate the citation network
(b) the set of papers whose citations are not available in the dataset (Set-II).
4.4.3. Closeness centrality (CC ) The metric attempts to capture how centrally a node is located vis-a-vis other nodes and is measured as the inverse of total pair-wise distances from the node to all other nodes. Closeness centrality of a node p is defined as
The system will generate a citation network only with the Set-I papers based on references. Following centrality measures will be used on them to determine their importance [54, 55].
Jo
4.4.1. Betweenness centrality (CB ) 1 CB of a node quantifies how frequently the node shows CC (p) = P (3) d q6=p G (p, q) up on different possible shortest paths between any two p∈V given nodes. Here CB of a paper q is defined as 640 where dG (p, q) denotes the distance between vertices p X σpk (q) and q, i.e. the minimum length of any path connecting p (1) CB (q) = σpk and q in G. p,k,q∈V p6=k6=q
625
where σpk denote the number of shortest paths from p to k and σpk (q) denote the number of shortest paths from p to k via q. 645 Nodes with high betweenness act as potential deal makers [56]. 8
4.4.4. Eigenvector centrality (CE ) It denotes the importance of a given node in a network based on the node’s connections. A node is central to the extent that the node is associated with others who are central. It relies upon the quantity and quality of
Journal Pre-proof
neighbor nodes that are straightforwardly associated with the paper [58]. Eigenvector centrality measures not just what number of papers are linked with a paper, but additionally what number of important papers are connected with a paper. Eigen-vector centrality of a node p is 1 X ap,q × xq CE (p) = λ
Input: Citation network
Compute Bibliographic Coupling (BC) for each selected papers
(4)
Compute Co-citation (CC) for each selected papers
q∈Bp
650
pro of
where apq is the (p, q)-th element in the adjacency matrix A of papers. 1, if q is linked to p apq = (5) 0, otherwise
Calculate distance of individual paper from paper of interest (I)
Calculate candidate score (C-score) of individual paper
xq is the score of the eigenvector centrality of q, and λ is the eigenvalue of p. It measures the influence of set q ∈ Bp consisting of all papers connected to paper p.
Output: Select top scoring
665
670
675
re-
Jo
660
urn a
lP
655
4.4.5. Hyperlink-induced topic search - HITS (CH ) papers as candidate papers HITS is a link analysis algorithm based on hub and authority concept. A good hub represents a paper that points Figure 5: Graph model for candidate score computation to many other papers, and a good authority represents a paper that is linked by many different hubs [59]. Given authority and hub-weights u(p) , v (p) respectively punish any papers which do not have enough citations (Set of a node p, the operation to update the u-weights is as680 - II). follows. A smaller citation network is generated considering only X u(p) ← v (q) (6) the shortlisted papers, and the connected components q:(q,p)∈E (mathematically they are weakly connected components as directed edges will be deemed as undirected edges for findSimilarly, the operation to update the v-weights are as 685 ing the connected components and henceforth often refollows. X (p) (q) ferred to as merely components) are identified. v ← u (7) We remove such components having less than the averq:(p,q)∈E age number of nodes retaining the rest where, The set of weights u(p) is represented as a vector U with a Total #nodes in citation network coordinate for each page in G. Similarly, the set of weights Average #nodes = Total #connected components v (p) as a vector V . (8) This step further reduces the number of potentially non4.4.6. Motivation of selecting various centrality measures relevant papers. Different centrality measures are summarized in Table 1. The average score of each measure is used as a threshold 4.4.7. Complexity analysis to shortlist in parallel, which are combined to filter only Although our academic bibliographic data is huge conunique papers (Fig. 4). We choose all of them individually690 taining large number of nodes (n) (Table 16), the graph is as the aim of filtering is to first remove the unimportant actually a sparse one as the number of edges (m) is much papers before selecting the important ones. less (m < O(n2 )). In this paper (implementation with For example, if a very high-quality paper has low inMAG datset), we found around 5k-13k papers (Computer degree because of its recent publication, the paper may not be considered in degree centrality calculation, but it695 Science) after keyword-based search strategy (Sec. 4.3). But most of the papers were there without any citations. gets due consideration in Betweenness, Closeness, EigenFor 72 different seed papers, the average number of papers vector and HITS centrality calculation and, therefore, may found with citations (Set-I) were 4, 373 and average numqualify based on these measures (Fig. 4). This way if a ber of edges were around 7, 496. The average degree of a paper lacks in one or more factors in the citation profile, it can qualify through other centrality measures implying700 node was 1.71. fair chance to all potential papers. Moreover, this exercise In Social network analysis, a Degree centrality measure is restricted only to the Set-I papers, which are having a has a time complexity of O(m). Both Closeness and Begood number of citations. The task, therefore, does not tweenness centrality of all vertices in the citation network 9
Journal Pre-proof
Table 1: Interpretation of centrality measures used in citation network [60] Meaning
Interpretation in citation networks
Degree Betweenness
Node with most connection Connects disconnected groups
Closeness
Rapid access to related papernodes Connections to high-scoring nodes Directed weighted degree centrality
How many papers can this article reach directly? How likely is this papers to be the most direct route between two papers in the citation network? How fast can this paper reach everyone in the citation network? How well is this paper connected to other well connected paper? How is the content of the paper and the value of its link to other paper?
Eigenvector HITS
720
725
730
4.5. B. Content Analysis From each such shortlisted connected components, title and keywords similarity of all papers with the seed paper (input paper) are re-considered. For title similarity, Python nltk Wordnet is utilized [64] 13 , as given in Algo. 1 and Algo. 2. We employed WuPalmer similarity (W up similarity) to compute the similarity among Synsets [65]. Synsets are organized in wordnet taxonomies (hypernym tree) in such a way that the root or higher-level terms are more abstract terms (hypernyms) and lower-level terms are more specific (hyponyms).
extracted from seed the paper and from the test papers respectively or vice-versa [66], it is defined as J(P, Q) =
It is mainly calculated the similarity by considering the depths of two Synsets and that of their Least Common Subsumer (more specific ancestor node) 14 . depth(lcs(s1 , s2 )) (depth(s1 ) + depth(s2 )) (9) Where 0 < W up similarity(s1 , s2 ) ≤ 1 and lcs stands for Least Common Subsumer. The score can never be 0 as the depth of the lcs is never 0 (depth of the root of taxonomy740 is 1). Whenever multiple candidates for the lcs exist, the one having the longest paths to the root will be selected during the calculation. Jaccard similarity coefficient is employed for keyword similarity (See Eqn. 10). If P and Q be a set of keywords
Jo
W up similarity(s1 , s2 ) = 2 ∗
735
Input: St = the seed title to be compared with Ct = List of titles to be checked for similarity Output: List of Similarity scores (St , Ct ) Function Create Synset(S): Synset ← {} foreach word w ∈ S do P OSw ← do parts of speech tagging Synset ← Synset ∪ wordnet.Synsets(w, P OSw )[0] /* adds only the first synonym for w */ end return Synset End Function SynsetS ← Create Synset(St ) /* Synset of St */ foreach title tj ∈ Ct do Synsetj ← Create Synset(tj ) /* ∀tj ∈ Ct */ Scoresj ← Sim(SynsetS , Synsetj ) /* Algo. 2 */ end return Scores = [Scoresj ]
re-
715
Algorithm 1: Similarity score generation
lP
710
involve the shortest paths between all pairs of vertices on a graph, which takes O(mn) time using Brandes algorithm [61]. This algorithm performs a simple breadth-first search, in which distance and shortest-path counts are determined from each vertex. For computing the Eigenvector and HITS centrality measures, power iteration method can be used that approximates the metrics within a few steps and also avoids numerical accuracy issues. That way both Eigenvector and HITS centrality measures can be done in O(n2 ) time [62, 63]. Hence, overall computational complexity of social network analysis is not more than O(n2 ).
urn a
705
pro of
Centrality measures
745
13 It
is an open-source package in Python language which is trained on English Wordnet 14 www.nltk.org/howto/wordnet.html
10
|P ∩ Q| |P ∪ Q|
(10)
where 0 ≤ J(P, Q) ≤ 1. Later, we find the cumulative similarity score as an average of the two similarities for each paper in the connected components as follows. Cumulative similarity =
Title similarity + Keywords similarity 2
(11)
These cumulative similarity scores are computed for both Set-I and Set-II papers. However, the following steps are done for Set-I papers. Set-II papers join at the end of Main Path Analysis(E). 4.5.1. Identification of papers of interest (I) The cumulative similarity scores (Eqn. 11) are used to identify top-k (we take k = 10) papers from each connected component that are most similar to the seed paper. Abstracts of these top-k papers are extracted, and abstract similarity is calculated with the seed paper applying Okapi
Journal Pre-proof
Algorithm 2: Synset similarity algorithm
765
770
pro of
Q
Paper R
Paper B
Figure 6: The structure of citation analysis
4.6.1. Bibliographic Coupling (BC) It gives a measure of the similarity between two papers based on the number of common papers they jointly cite.
ln
P − wf + 0.5 wf + 0.5
!
(n1 + 1).cf n1 (1 − r + r wl ) + cf avwl
lP
t∈Ps ∩Pts d
+ δ
(n3 + 1)qcf n3 + qcf
(12)
where, cf is the term t’s frequency in testing paper (Pts ), qcf is the term’s frequency in seed paper (Psd ), P is the780 total number of papers identified from each components, wf is the number of testing papers that hold the term t, wl is the length of abstract (in bytes), avwl is the average abstract length of papers in each components, n1 (between 1.0-2.0), r (usually 0.75), n3 (between 0-1000), and the value of δ is a constant (usually 1.0). The paper having the highest BM25+ score with the seed paper is chosen as the paper of interest (I) for each selected components.
urn a
760
Paper
P
BC(C, D) = |LC ∩ LD | (13) BM25+ 15 . where LC , LD are set of bibliographic lists in C & D reOkapi BM25+ is based on the probabilisspectively. tic retrieval framework [67, 68], whose weighting Fig. 6 illustrates bibliographic coupling, showing that based similarity score can be expressed as follows. 775 papers P , Q and R are cited by both papers C and D. Abstract similarity(Psd , Pts )= BC strength of papers C and D is, hence, 3. X
755
Paper D
Paper
Paper A
Jo
750
Paper C
re-
Input: SynsetS = array synset terms S Synsetj = array of synset terms j Output: Similarity score (SynsetS , Synsetj ) Initialization Score ← 0, W ord count ← 0 Function Sim(SynsetS , Synsetj ): for each word ai in SynsetS do Best score ← 0 /* Best score for each word in SynsetS */ for each word bi in Synsetj do if Wup sim(ai , bi ) > Best score then Best score ← W up sim(ai , bi ) /* W up sim as per Eqn. 9*/ end end if Best score 6= 0 then Score ← Score + Best score /* Sum of all Best score */ W ord count ← W ord count + 1 /* Number of words in SynsetS */ end end Score Sim score ← W ord count return Sim score End Function
4.6.2. Co-Citation(CC) It denotes the number of other papers that cite two given papers together. The co-citation strength (CC-strength) can be computed as follows. 4.6.3. Candidate score computation (C-score) While BC score implies similarity of two papers based on similar sub-domain and time (referring the same set of papers), CC score implies shared authority on a particular sub-domain or very general domain (same set of papers jointly refer the two). Taken together, they represent the importance and contemporariness of a pair of papers within a citation network.
4.6. D. Citation Analysis CC(A, B) = |IA ∩ IB | (14) The papers so selected provide the basis of further study where, IA denotes the set of papers that cite paper A, in of the interplay of the papers within a component based other words, |IA | = indeg(A) In Fig. 6, CC(A, B) = 3 as on co-citation analysis. We look at the bibliographic coupapers A, B are co-cited by papers P, Q, and R. pling (BC) and co-citation (CC) scores for each candidate 785 Summation of BC and CC respectively of a particular papers [69, 70]. paper paired with all others in the component can, therefore, be an important feature and we combine them into a 15 BM stands for Best Matching. To address the deficiency of Okapi single measure called the candidate score (C-score) which BM25 in its term frequency (TF) normalization component, Okapi BM25+ (a variant of Okapi BM25) is employed. is defined by 11
Journal Pre-proof
P5
P12
P1
P16
P20 P23
P9 P6
P2
P7
P3
I
P10
P8
P25
P4
P17
P21
P14
P18
P22
P15
P19
P26
P24
P27
pro of
P11
P13
Figure 7: Seven-level citation network
Table 2: Computation of C-score for papers in the citation network
800
805
810
815
Total Similarity
d(I,k)
C-score
6 1 1 2 6
8 1 4 3 8
2 2 2 3 3
4.0 0.3 2 1 2.6
o∈SI
[BC(k, o) + CC(k, o)] d(I, k)
820
(15)
We propose that P6 is more similar to the user topic than P22 is. On the other hand, although P6 and P17 are at the same distance, P17 has a lower C-score because the total similarity of P17 is lower than that of P6 . Again, BC of P17 is greater than BC of P22 but the Cscore of P22 is higher than that of P17 . The papers with low C-scores tend to be isolated from the network community. P15 is likely to be an irrelevant paper probably produced by self-citations, and ceremonial citations16 . Computation of C-scores are thus done for each of the papers with respect to an I in the component. Based on a user-specified threshold, the papers with higher C-scores are selected as candidate papers which are used to generate a citation network. In this work, experiments were conducted for 72 seed papers in computer science domain and 48 seed papers in the biology domain. We observed that choosing average Cscores as the threshold fit the bill and the average number of papers left after this step was in between 800 and 2000 (Computer Science). In the above example, initially, there are a total of 27 papers, excluding I in the citation network depicted in Fig. 7. Considering average C-score as a threshold as appeared in Fig. 8 only 14 papers are shortlisted as candidate papers.
lP
where SI is the collection of nodes in the component with the paper of interest I, and d(I, k) is hop-distance825 from I to k. The reason for using such hop-distance along with both BC and CC lies in the state-of-the-art literature [71]. All papers other than paper “k” are represented as “o”. The C-score considers the relevance of “k” with, o and I.830 We normalize the score so obtained by d(I, k) in order to provide incentives to the papers that are close to I and penalize the papers at a distance. The overall process of candidate papers selection is depicted in Fig. 5. 835
urn a
795
Total CC
2 0 3 1 2
4.6.4. Illustrative example of candidate paper selection Fig. 7 shows an illustrative example with representative citation scores in Table 2 (Fig. 7 is restructured used by Son et al. [71]). P6 has out-degree 2 meaning P6 840 can have maximum two pairs with positive BC-strength. BC(P6 , P1 ) = 1 = BC(P6 , P7 ) and BC for all other pairs of nodes involving P6 is zero. Again, P6 has in-degree = 4 meaning it can have maximum 4 pairs of positive CCvalues. CC(P6 , P9 ) = 1 = CC(P6 , P5 ), CC(P6 , I) = 4. The aggregate score of the BC and CC is 8, which is the 845 numerator of the C-score for P6 . Similarly, the denominator of the C-score for P6 is d(I, P6 ) = 2. Similarly, BC and CC values of all other nodes are computed. Some example nodes are given in Table 2. Although total similarity score (sum of BC and CC) of paper P6 and P22 are equivalent, C-score of P22 is less than that of P6 because P22 is farther from I than P6 is.
Jo
790
Total BC
P6 P15 P17 P20 P22
re-
C-score(k) =
P
Paper
12
4.6.5. Complexity analysis Let us assume there are n1 number of vertices and m1 number of edges exist in the citation network generated among papers shortlisted after title and keywords matching as defined in Sec. 4.5 (only for papers shortlisted after 16 Ceremonial citation is one that is done even though the citing paper is very lightly related with the cited publication
Journal Pre-proof
C8
C3
C1
C4
C2
C5
I
C6
C9
C14
C7
C10
C11
pro of
C12
C13
Figure 8: Identification of papers with higher than average C-scores
865
870
875
885
C1
C2
C4 C5
I
C8
C6
C9
C7
C10
C13
C14
C11
C12
Figure 9: Selection of candidate papers
The most significant path in a citation network is traced by main path analysis. This is used to identify the structural backbone in the evolution of a scientific field [75]. Main path analysis is more useful when there is a need to investigate connectedness in acyclic networks and especially draw attention when nodes are time-dependent, as it chooses the most representative nodes at different points of time [76]. To determine the main path, the following steps are needed to be considered. (i) Assignment of link-weights in the network using traversal count.
Jo
4.7. E. Main Path Analysis 880
C3
re-
860
For example, if there is a citation from node B → A in citation network, a directed link B ← A is drawn in the reference-flow graph for main path analysis. It means paper A is cited/referred by paper B. After applying the above changes in the citation network in Fig. 8, the new reference-flow graph is obtained as given in Fig. 9.
lP
855
social network analysis). Considering the citation network890 contains a small diameter d, phenomenon known as ”six degrees of separation” [72–74]. In this paper, average number of nodes and edges were found around 1, 190 and 1, 837 respectively. The average degree of a node and average diameter were 1.54 and 8.3 respectively. 895 Citation analysis mainly involves two types of steps such as Bibliographic Coupling (BC), and Co-Citation (CC) in a citation network. Let’s assume the maximum number outdegree, and indegree of a given vertex are k1 , and k2 respectively. To identify the BC of a given node pi , we need to move towards each outgoing vertices of that particular node pi . Then for each outgoing vertices of pi , we need to visit only those vertices whose outgoing edges are directly connected to any outgoing vertices of pi . As a result, the total time complexity for computing BC for a single vertex is O(k1 k2 ). So for all the nodes in the citation network, the complexity is O(n1 k1 k2 ). Similarly, for CC computation of a particular vertex pi , we need to initially visit all the vertices whose incoming edges are directly connected to vertex pi . Later, for each of these vertices whose outgoing edges are connected to vertex pi , we need to traverse all outgoing vertices. So a total of O(k2 k1 ) complexity is needed for one vertex. As a result, O(n1 k1 k2 ) complexity is needed for all the900 nodes in the citation network. The total time needed to compute the distance of each vertex from a given vertex pi is O(m1 + n1 ). Hence, overall computational complexity of citation 905 analysis is not more than O(m1 + n1 ).
urn a
850
The citation network so produced after shortlisting of papers with C-score is used for main path analysis. Here we try to arrange the papers in a chronological fashion so that knowledge is supposed to flow from a source to sink paper. A source is an original paper that is supposed to910 introduce a new domain (loosely can be considered as the origin of knowledge) whereas sink is the most recent paper which cites its ancestor papers. To build this directed graph (we call it reference-flow graph), we need to reverse the direction of the citation network. 915 13
(ii) Identification of key route in reference-flow graph. 4.7.1. Assignment of link weights in citation network For identifying the main path in any network, the links in the network are assigned weights using traversal count [77] that measures the importance of a link. The number of traversals of a link for different source-sink pair of nodes is known as traversal count of the link. It has several variants, depending on how the pairs are chosen.
Journal Pre-proof
960
C3
1 C1
1
C4
1 1
C2 5
4
8
I
C12
Source
2
C14 965
2 C7
5
C9
10
8 4
C13
5 C6
4 C5
5
C8
10
4
5 C10
1 C11
Sink
C1 − C12 − C2 , C13 − C8 − C6 − I − C12 − C2 , C14 − C9 − C7 − I − C12 − C2 and C11 − C9 − C7 − I − C12 − C2 pass through it. Since the Link I-C5 is a part of four distinct paths C11 − C8 − C6 − I − C5 , C13 − C8 − C6 − I − C5 , C9 − C7 − I − C5 and C14 − C9 − C7 − I − C5 respectively, its SPC value is 4. The link-weights as SPC are shown in Fig. 10. Links C9 − C7 and C8 − C6 have the highest SPC value of 10. Initially, the link C9 − C7 and C8 − C6 are chosen due to their highest SPC value as 10. Now, search backward from the beginning node until a point that source is hit and search forward from the end nodes C7 and C6 until a sink is hit. We get C13 − C8 − C6 − I − C12 − C2 , C11 − C8 − C6 − I − C12 − C2 , C11 − C9 − C7 − I − C12 − C2 and C14 − C9 − C7 − I − C12 − C2 as the global key routes respectively. The sum of the SPC values in all the key route paths is 32, which, is the largest among all possible paths, as shown in Fig 11. We have a total of 10 papers in the key
pro of
920
We are using Search Path Count (SPC) 17 for weighing the links. If a path through which much knowledge flows includes a citation link, it has a certain prominence in the knowledgedissemination process. Using SPC as traversal count in955 Fig. 9, we obtain the link- weight of the citation network as given in Fig. 10. The most significant links are added to form the main path connecting a source and a sink node [76] as described below.
Figure 10: Assignment of link-weights using SPC technique
C3
930
4.7.2. Identification of key route in citation network There can be several paths between a source and sink pair having the same traversal count. We need to select the most promising one. Local, global, and key-route search are some of the various approaches to identify it. We find the key-route search in the following way.
C1
1
C4
1
1
C2
lP
925
re-
1
(i) Choose the link having the maximal traversal count. This link is considered starting link of key-route.
5
C5
4 4
8
I
C13
5 C6
4
2 C7
2
5
C9
10
8 4
5
C8
10
C14
5 C10
1 C11
C12
Figure 11: Key-route identification using main path analysis
970
940
945
950
(iii) Go in reverse from the begin node of the key-route until the point that a source node happens. 975 By executing the steps many times, multiple key-routes can be found, each time choosing the link with the nexthighest traversal count. However, the first such key-route with highest SPC is considered as the main path. 980
4.7.3. Illustrative example of key route identification Let us consider a simple citation network depicted in Fig. 9. It has 4 sources, C1 , C11 , C13 , C14 and 5 sinks such as C2 , C3 , C4 , C5 , C10 . There are various substitute paths to traverse from the sources to the985 sinks. Assuming that one exhaustively searches all paths from every source to every sink, the SPC for each link is defined as the total number of times the link is traversed. For example, Link C12 − C2 has an SPC value of 5 because paths C11 − C8 − C6 − I − C12 − C2 ,990
Jo
935
urn a
(ii) Push ahead from the end node of the key-route until the point when a sink node happens.
17 A link’s SPC is the number of times the link is traversed if one runs through all possible paths from all the sources to all the sinks.
14
routes, including the paper of interest I. After retaining only unique papers in the key route, we have C2 , C6 , C7 ,C8 , C9 , C11 , C12 , C13 , C14 and I as the final candidate papers. Recall that we had a total of 25 papers initially in the network. After applying C − score, there were 15 papers, including I that were shortlisted. Now, after applying the key route, we end up with 10 papers, including I as significant papers in the citation network (Fig. 11). 4.7.4. Complexity analysis In the main path analysis, we need to traverse all the outgoing vertices of a given source vertex si . Then repeatedly for each outgoing vertices of si we need to traverse their corresponding outgoing vertices till we reach a sink node sj in the citation network. In this paper, average number of nodes and edges were found around 503 and 872 respectively with an average degree of 1.73. It is clearly indicates the sparseness of the citation network. The diameter of the network was 6.4. The maximum outgoing vetices and maximum incoming vertices of a node were around 23 and 57 respectively. We need to visit all possible paths from a given source vertex to a sink vertex. As mentioned in Sec. 4.6.5, the diameter of the citation network is d, and the number of outdegree of a given vertex
Journal Pre-proof
995
1000
is k3 . So for a given source vertex si , the time complexity will be O(k3d−1 ). Assume that there are s1 number of1045 source vertices and s2 number of sink vertices present in the citation graph. So the total time complexity will be O(s1 k3d−1 s2 ). Hence, overall computational complexity of main path analysis is not more than O(s1 k3d−1 s2 ).
topics or vector dimension considered here is 100 during the topic extraction. Thus we get two lists of papers sorted in decreasing order of similarity scores based on LDA and NMF respectively. Algorithm 3: Fusion-based final venues ranking
4.8. F. Result Fusion
1005
pro of
To find the final ranking of venues, the following steps are followed sequentially. (i) Merging Set-I and Set-II papers for abstract similarity using LDA and NMF techniques (ii) Extraction of unique venues and their similarity computation (iii) Normalization of similarity scores and their fusion
1015
4.8.1. Merging Set-I and Set-II papers for abstract similarity using LDA and NMF techniques Title similarity and keyword similarity are computed for Set-II papers as well using Algo. 1 and Algo. 2 and keyword matching is performed by Eqn. 10. Top t2 similar papers are chosen based on cumulative scores. We have three assumptions regarding the inclusion of Set-II papers for abstract similarity.
re-
1010
1035
1040
urn a
1030
(b) The title and keywords of the seed paper may match with some papers in Set-II, so there is a possibility that the seed paper may get accepted at similar venues as that of Set-II papers. (c) Generally, the papers published in reputed venues get a high number of citations.
So along with the shortlisted key-route papers (t1 ), we add shortlisted Set-II papers (t2 ) to make a combined list of shortlisted papers (t1 + t2 ). For each paper in the list, we find abstract similarity with the seed paper using LDA and NMF techniques independently. We use LDA and NMF as these techniques capture topics rather than exact terms. At this point, when sufficient care has been taken on keywords match and also some papers are qualified based on other criteria (not keyword matching), we believe some abstract level1050 topic matching would work. Also, we use both LDA and NMF separately since LDA mainly considers terms in a document independent of its presence in other documents to identify topics [78]. NMF, on the other hand, tries to capture a set of words occur-1055 ring together in a topic using tf-idf vector [26]. We assume these two methods are complementary in nature and provide two different ranks to a given paper. The number of
Jo
1025
lP
(a) There may be few papers that have no citations (SetII), but any such paper may be published at reputed venues. 1020
Input: Observed shortlisted key-route papers (t1 ) and shortlisted Set-II papers (t2 ) to check abstract similarity Output: Top N recommended list of venues Initialization let m be the seed paper let T = t1 + t2 be the set of candidate papers while T is not empty do for i ← 0 to |T − 1| do Compute abstract similarity with m using LDA technique Li ← similarity score of each candidate papers Compute abstract similarity with m using NMF technique Ni ← similarity score of each candidate papers end Sl =Ordered list of papers in decreasing values of LDA based similarity score Vl =Set of unique vanues based on top scoring papers in Sl for k ← 0 to |Vl − 1| do lk ← LDA based similarity score of each candidate venues nk ← NMF based similarity score of each candidate venues end lk −min(lk ) Normalize(lk ) ← max(l ; k )−min(lk )
15
nk −min(nk ) Normalize(nk ) ← max(n ; k )−min(nk ) end for k ← 0 to |Vl − 1| do N(vk )=Normalize(lk ) + N ormalize(nk ) CombM N Zvk = N (vk ) ∗ |Ns > 0| end N(vk ) ← Normalized score of venue vk in result set LDA and NMF |Ns | ← No. of non-zero normalized scores given to vk by any result set LDA and NMF Sort venues in the decreasing order of CombM N Zvk scores Prepare the final list of top N venues recommendation
4.8.2. Extraction of unique venues and their similarity computation We extract the venues corresponding to the papers in the two ranked lists and make two lists of unique venues. Venues are ordered based on the top-scoring papers published there. We also assign the score to the venues such that each venue vk represents the highest similarity score of papers published at vk . Hence we have two ordered lists of venues. The entries in the lists are the same set of venues occurring possibly at different ranks with different similarity scores.
Journal Pre-proof
1075
1080
1085
5. Experiments
We conduct an extensive set of experiments. Below we 1125 outline the experimental dataset, evaluation strategy, evaluation metrics, experimental setting, parameter tuning, and other comparable methods. All experiments are conducted on a 64-bit and 2.4GHz Intel Core i5, 8-GB memory system. 1130
5.1. Dataset Used
1100
1105
1110
urn a
1095
Type
No. of Records (BIO)
Papers Total F OS F OS in Level 0 F OS in Level 1 F OS in Level 2 F OS in Level 3
15,641,658 14,417 1 (CS) 35 685 13,696
Papers Total F OS F OS in Level 0 F OS in Level 1 F OS in Level 2 F OS in Level 3
14,785,486 10,522 1 (Biology) 15 523 9,976
language”) and 1 at level 0 (e.g. “Computer Science”). Similarly, in BIO there are 15 F OSs present at level 1. The fields related to each other have a confidence score implying relatedness among fields. The dataset does not have full-text or abstract of the publications. We have used a web-based crawler to extract the required set of abstracts by using the available title, year, URL, and DOI from the Web before applying abstract similarity. 5.2. Evaluation strategy
We adopt the following two kinds of evaluation to measure the performances of DISCOVER against other stateof-the-art methods. (a) Coarse-level or offline evaluation: As the name suggests, it provides some raw-level quick notion of how the proposed DISCOVER fares vis-a-vis other systems. We focus on the prediction accuracy to see whether the original publication venue for the test paper is predicted or not, and if yes, at what rank within some top N recommendations. Accuracy, MRR and F − measuremacro evaluation metrics are used during the evaluation (detailed below). We call this scenario offline because we can evaluate a system this way only when we have annotated test data.
We use the Microsoft Academic Graph (MAG) dataset [22, 23]. It is a heterogeneous dataset of scholarly publications publicly available and the most extensive dataset of open citation. It is currently being updated on a1135 (b) Fine-level or online evaluation: This evaluationscenario is more realistic (and, that’s why we call onweekly basis. The dataset consists of various types of entiline) as a researcher needs to have more than one ties: publications, institutions (affiliations), authors, fields venue recommendation from a system for her paperof study (F OS), venues (journals and conferences), events in-writing that she wants to communicate. Here we (specific conference instances), and the relations among 1140 go a little deeper and aim to see the relevance, usefulthese entities. ness, and quality of the recommended results. The The dataset also contains metadata of the papers, such system recommends an ordered list of venues that as title, DOI, and year of publication. It also has excelare assessed by experts in terms of graded relevance lent coverage over various domains. All the F OS are con(Eqn. 29). Precision, nDCG, average venue quality, structed hierarchically into four levels (level 0 to level 3, 1145 diversity and stability are used as evaluation metrics level 3 being of the highest granularity). For our study, we in the evaluation. use the version of MAG published as on 5 February 2016. We process the dataset by retaining only papers related 5.3. Evaluation metrics to the F OS of “Computer Science (CS)” and “Biology (BIO)” occurring at level 0. For both, we have collected We employ the following eight metrics that we find suitonly the papers published in the year 1982-2016 (35 years’ able to capture the necessary features for both types of data). 1150 evaluation. The major attributes of the dataset used are specified in Table 3. There are 13,696 fields of study (F OS) at (a) Accuracy@N: It is the ratio of no. of times a syslevel 3 (e.g. “COBOL”), 685 at level 2 (e.g. “Low-level tem correctly predicts the original publication venue programming language”), 35 at level 1 (e.g. “programming within some fixed top N recommendations for a set of
Jo
1090
No. of Records (CS)
pro of
1070
Type
re-
1065
Table 3: Statistics of both CS and BIO papers (Subset of MAG)
lP
1060
4.8.3. Normalization of similarity scores and their fusion The similarity scores depend on the techniques (LDA or NMF) and hence are not readily comparable. To compare, we need to normalize the scores and then apply some fusion techniques to get a single ordered list of venues. In the recommendation community, data fusion is one of the widely investigated areas. We used score-based fusion over rank-based one as it reduces the number of ties. There are several score-based fusion techniques such as CombSum, CombMNZ, and weight combination [79]. We use CombMNZ fusion technique [80]. To run CombMNZ, similarity scores of two lists must be normalized, so that they lie in a common range. There are different normalization strategies proposed in1115 the literature. We select the one used by Lee et al. [81], as it is the one most commonly used for comparison and has been defined as “standard normalization” [82]. We apply CombM N Zv on normalized scores of venues. Finally, we recommend top-N venues based on CombM N Zv scores where N (usually N 6= t1 or t2 ) is user-specified. The1120 complete algorithm is provided in Algo. 3.
16
Journal Pre-proof
test papers [29, 34]. Here we consider N = 3, 6, 9, 12 and 15 respectively.
1155
# system correctly predicts venues within top N Total number of test papers (16)
If N is small and/or the system is poor, it may fail to predict/recommend the original venue for a given paper. Hence, we need to see it for a number of papers. Higher the number of papers, the better it will reflect the potential of the system. The score can be any real number between 0 and 1. (b) Mean Reciprocal Rank (MRR): MRR is the arithmetic mean of reciprocal rank (RR) which is the inverse of the first rank where the correct original venue is recommended in the ranked result [83]. M RR =
|Q|
1 1 X |Q| i=1 rankreli
(e) Normalized discounted cumulative gain (nDCG): It represents the ratio of discounted system gain and discounted ideal gain accumulated at a particular rank p, where gain at a rank p is the sum of relevance values from rank 1 to rank p [83]. Relevance value in our system (relsj ) is a score (0, 1 or 2) assigned by a researcher to the venue at position j. Ideal vector is constructed hypothetically where all relevance scores (relij ) are ordered in decreasing order to ensure the highest gain at any rank.
where rankreli denotes the rank position of the first relevant document for the i-th query in a query set Q.
re-
|relevant venues ∩ recommended venues| total recommended venues (18) Precision@k means when k venues are recommended, i.e.,
P recision =
(23)
p X relij IDCGp = reli1 + log 2 (j) j=2
(24)
nDCGp =
DCGp IDCGp
(25)
(f) Diversity (D): It is defined as the average dissimilarity (opposite of similarity) between all pairs of items in a result set [85, 86]. PN PN i=1 j=1 (1 − Similarity(vi , vj )) D =2∗ (26) N (N − 1)
lP
(c) Precision: Precision is the fraction of retrieved items that are relevant. In our context, it is the fraction of recommended venues that are relevant, as shown in Eqn. 18. 1165
where N is the length of the recommendation, vi and vj are the venues appearing in the recommendation lists and Similarity(vi , vj ) denotes the content (abstract, keywords) similarity among venues vi and vj .
urn a
(g) Stability: A recommender system is stable if the predictions do not change strongly over a short period of time [87]. It is also called the mean absolute shift (MAS), designed to capture the internal consistency |relevant venues ∩ recommended venues| P recision@k = among predictions made by a given recommendation k algorithm [3]. It is also defined through a set of train(19) ing data R1 and a set of prediction (ranking of original (d) F − measuremacro (F1 ): F1 measure is defined as the venue) of seed paper, P1 . For an interval of time (adunweighted harmonic mean of precision and recall. dition of new data into the training data), the recomHere we consider macro-averages for both precision mender system can now make a prediction, P2 . MAS and recall. The macro-average is the average of the is defined as same measures calculated for all classes. It treats all X 1 Stability = M AS = |P2 (u, i)−P1 (u, i)| (27) classes equally. For an individual class Ci (number |P2 | (u,i)∈P2 of venues), if within-class true positives are tpi , true negatives tni , false positives f pi , and false negatives where P1 , P2 are the predictions made in phase 1 and f ni [84], then following are the definitions of necesphase 2, respectively. sary metrics. (h) Average-Venue Quality (Ave-quality): It evaluates PN tpi the quality of the venues recommended by a system i=1 tpi +f pi (20) P recisionmacro = based on Google’s h5-index [14]. N P PN tpi H5v i=1 tpi +f ni Average-venue quality = v∈V (28) Recallmacro = (21) |V | N 17
Jo
1160
p X relsj log 2 (j) j=2
DCGsp = rels1 +
(17)
Although accuracy shows how often a system correctly predicts within a given rank, it does not take into consideration at what rank. MRR plugs the gap here and incentivizes the system that predicts at early ranks.
2P recisionmacro × Recallmacro P recisionmacro + Recallmacro (22)
pro of
Accuracy@N =
F − measuremacro =
Journal Pre-proof
1175
1180
1185
(i) Category 1 : 2 ≤ vc < 8
where V is the set of recommended venues and H5v is the h5-index of venue v. Higher the Ave-quality, we can claim, the better is the recommendation.
(ii) Category 2 : 8 ≤ vc < 15
Precision captures the overall performance of the system in terms of how many relevant venues a system can recom-1220 mend - a requirement often from a prospective researcher before sending her manuscript. However, precision only considers whether a venue is relevant or not. In reality, the relevance of a venue can be more fine-grained or graded like exactly relevant, partially or moderately relevant, not relevant, and so on. nDCG takes into consideration this subtlety and provides an idea of system performance with1225 respect to an ideal system that ranks the recommendation in decreasing level of relevance. Both these metrics are bounded between 0 and 1 and are used for online evaluation. 1230
(iv) Category 4 : 2 ≤ pc < 8 (v) Category 5 : 8 ≤ pc < 15 (vi) Category 6 : 15 ≤ pc
It is ensured that each category is well represented in the seed papers for both CS and BIO data. 5.4.2. Procedure of online evaluation For this evaluation, we did not have the ready annotation, but we need one. The annotation or relevance assessment is collected from the volunteers through crowdsourcing in the best effort basis. For CS, 40 researchers with expertise in the mentioned sub-domains are provided with input and output of our recommender system where for each paper, 15 venues are recommended. Out of 40 researchers, 9 evaluated 3 papers each, 14 researchers evaluated 2 each, and the rest 17 were evaluated by 17 researchers. Similarly, for BIO, 25 researchers volunteered. There were 7 researchers who evaluated 3 papers each, 9 researchers evaluated 2 each, and the rest 9 researchers evaluated one each. All the experts were identified from academia with a minimum of 3 years of research experience. Most were having a Ph.D. except few research students and research assistants who were pursuing Ph.D with bachelors’ or masters’ degree in science or technology. The experts or researchers were so chosen that their active areas of research perfectly match with the topics of seed papers. Among 65 researchers, there were 12 professors, 9 associate professors, 24 assistant professors, 13 senior research students, and the remaining 7 were research assistants. All experts were from reputed institution like Indian Institute of Technology Kharagpur, Indian Institute of Technology Roorkee, Indian Institute of Technology (BHU) Varanasi, Central University Hyderabad, Manipal University, and Banaras Hindu University (BHU). The age range of all professors are in the range of [48-55], age range of associate professors are in between [43-47], assistant professors are having an age of [36-41], senior research students are in the age range of [28-31], and remaining research assistants are having an age range of [29-33]. The overall gender distribution of male and female experts were 44 and 21, respectively. The experts check the titles, abstracts, authors, year of publication, and recommended venues of the papers. An expert assigns an appropriate relevance value (r) to each recommended venue as she deems the quality of the match between the scope of the recommended venue and the topic of the seed paper as below. 2 perfectly matching Relevance (r) = 1 partial matching (29) 0 otherwise
re-
1190
5.4. Experimental setting Initially, we consider papers from CS. Removing the papers with no venues details, there are only 13,402,547 papers in CS. Similarly, we preprocess papers from BIO domain and are left with only 12,848,227 papers. All the1235 papers published on or after the year 1982 and before the year 2012 are used as a training set, the rest (papers dated in or after 2012) are as the testing set as given in Table 4.
(iii) Category 3 : 15 ≤ vc
pro of
1170
1240
1205
1210
1215
Pre-processed Dataset
Training Dataset
Testing Dataset
Computer Science (CS) Biology (BIO)
13,402,547
10,424,960
2,977,587
12,848,227
9,961,893
2,886,334
1245
5.4.1. Preparation of test dataset Due to operational constraints, only 12 sub-domains of CS and 6 sub-domains of BIO are selected as in testing1250 dataset. A total of 72 seed papers are chosen from 12 subdomains (6 from each): Information Retrieval (IR), Image Processing (IP), Security (SC), Wireless Sensor Network (WSN), Machine Learning (ML), Software Engineering (SE), Computer Vision (CV), Artificial Intelligence1255 (AI), Data Mining (DM), Natural Language Processing (NLP), Parallel and Distributed Systems (PDS) and Multimedia (MM) in CS. Similarly, a total of 48 seed papers are chosen from 6 sub-domains (8 from each sub-domains): Computational Biology (CB), Anatomy (AN), Immunol-1260 ogy (IM), Toxicology (TX), Biochemistry (BI) and Paleontology (PL) in BIO. Seed papers are chosen, keeping in mind the cold-start issues for new venues and new researchers. We consider 3 categories of venues and 3 categories of researchers based on venue count (vc ) (number of papers published at a given venue) and publication count (pc ) (the number of publications of a researcher) [14, 20] on the following six categories.
urn a
1200
FOS
Jo
1195
lP
Table 4: Statistics of training and testing dataset in CS and BIO
18
Journal Pre-proof
However, as precision is defined for binary relevance SQ2: only, during precision score computation, relevance grade1310 2 is only considered relevant, and both relevance grade 1 and 0 non-relevant.
SQ3: How does DISCOVER handle cold-start issues for new researchers and new venues and also the issues like data sparsity, scalability, diversity, and stability as DISCOVER has a few essential parameters during its1315 discussed in Sec. 2.4.3? process pipeline as follows.
5.5. Parameter tuning and optimization
(i) Number of top-k papers for identifying I (Refer Sec. 4.5.1) 1270
5.6. Baseline methods
(ii) Number of papers without citation history (t2 ) (Refer 1320 Fig. 2 and Sec. 4.4) (iii) Vector dimension for LDA (Refer Sec. 4.8)
1290
1295
1300
1305
Performance of DISCOVER is compared with the following eight state-of-the-art methods and also with freely available online services EJF and SJS.
re-
5.6.1. Comparison with existing state-of-the-art methods (a) Collaborative filtering models (CF): It is a memorybased implementation of collaborative filtering with a given paper-venue matrix. The underlying assumption is that there is a high probability for a paper to get published in venues where other similar papers have been published [29]. (b) Personal venue rating-based collaborative filtering models (PVR): It is based on the implicit rating given to individual venues, created from references of a researcher’s publications and the papers which cited researcher’s past publications [17]. (c) Content-based filtering models (CBF): The main idea behind the approaches is to compute the similarity between researchers and venues. Here we have taken researcher’s publications and content of all publications at the venues as feature vectors computed by LDA model [34].
lP
1285
5.5.2. Impact of t2 on reccommendation order We also experimentally test the effect of the number of 1335 papers (t2 ) selected from Set-II to perform abstract similarity. We changed the value of t2 from 5 to 50. The upper limit is taken as 50 to offer equal opportunity to Set-II as given to Set-I (on an average the main path analysis results in 45-85 number of papers). However, it is noticed that after 15 papers, there was no major change in the1340 recommended order, and hence, t2 is set to 15.
urn a
1280
5.5.3. Impact of vector dimension on final recommendation In order to find the appropriate value of dimen1345 sion (no. of topics) for LDA, we tried with the values{10, 50, 100, 200}. It is observed that the model performs best when the value of the vector dimension is 100. While considering the vector dimension as 200 it performs the second-best and shows the worst performance at vector dimension 10. Although 100 may not be ‘the best’ value1350 for dimension, at the coarse level, this is the optimized number of dimensions. In order to address the broad research questions (RQs) discussed in Sec. 1, we specifically examine the following sub-queries pertaining to them in the evaluation. 1355
Jo
1275
5.5.1. Impact of top-k papers on selection of I To identify the paper of interest (I), one for each com1325 ponent, the number of papers extracted after finding the title and keyword similarity is an important parameter to our system. Initially, we test with top 5 papers based on the cumulative score (of title and keywords matching). We then test with 10, 15, 20, 25 and 50 respectively and observe that there were not much changes on the selection1330 of I after top-10 papers. Hence, k = 10 is considered for abstract similarity.
pro of
1265
How is the quality of venues recommended by DISCOVER in comparison to other state-of-the-art methods in terms of H5-index of recommended venues as described in Sec. 5.3?
SQ1: How effective is DISCOVER in comparison to other state-of-the-art methods and other freely available online services as mentioned in Sec. 5.6?
(d) Friend based model (FB): Friend based models recommend venues based on the number of neighbors like a researcher’s co-author and co-author’s co-author. If a venue is attached to many neighbors, the venue is recommended [88]. (e) Co-authorship network-based models (CN): This model creates a social network for each author and then recommends venues based on the reputation of the author’s social network and other information such as venue name, venues sub-domain, number of publications [40]. (f) Random walk with re-start models (RWR): It runs a random walk with restart model on a co-publication network with two types of nodes: authors and venues. This model is similar to AVER, but the probability of skipping to the next neighbor node is equal in RWR [20]. (g) Hybrid approach (CF+CBF): We have mapped the citation web into a collaborative filtering rating matrix in such a way that a paper would represent a
19
Journal Pre-proof
1380
1385
1390
1395
1400
1405
Acc@3
Acc@6
Acc@9
Acc@12
Acc@15
MRR
FB CF CN CBF CF+CBF RWR PVR PAVE DISCOVER
0.013 0.027 0.027 0.041 0.053 0.055 0.083 0.125+ 0.222*
0.027 0.041 0.069 0.097 0.102 0.111 0.125 0.194+ 0.347*
0.069 0.097 0.097 0.166 0.179 0.180 0.208 0.250+ 0.472*
0.097 0.138 0.152 0.222 0.237 0.236 0.277 0.291+ 0.541*
0.180 0.208 0.236 0.291 0.319 0.347 0.388 0.416+ 0.708*
0.022 0.025 0.029 0.038 0.047 0.058 0.062 0.093+ 0.167*
pro of
Table 6: Accuracy@k and MRR results comparison of FOS “BIO”
(h) Personalized academic venue recommendation models (PAVE): It is similar to the popular random walk model other than the introduction of transfer matrix with bias. The probability of skipping to the next neighbor node is biased using co-publication frequency, relation weight, and researcher’s academic level in PAVE [14].
Approach
Acc@3
Acc@6
Acc@9
Acc@12
Acc@15
MRR
FB CF CN CBF CF+CBF RWR PVR PAVE DISCOVER
0.000 0.000 0.020 0.020 0.032 0.041 0.041 0.056+ 0.128*
0.020 0.044 0.062 0.083 0.097 0.083 0.104 0.145+ 0.223*
0.062 0.104 0.125 0.125 0.136 0.145 0.187 0.229+ 0.328*
0.104 0.166 0.187 0.187 0.203 0.229 0.250 0.354+ 0.491*
0.187 0.229 0.250 0.270 0.291 0.312 0.354 0.437+ 0.697*
0.021 0.028 0.030 0.032 0.044 0.036 0.039 0.067+ 0.163*
Among these seven methods, CF and PVR are based on ‘*’ denote statistically significant results over the second best (‘+’) collaborative filtering approach, CN and FB are based on co-authorship network, CBF is based on content-based filtering method and PAVE, and RWR is based on a random During the assessment, stistically significant results and walk with restart algorithm, CF+CBF is based on an in1410 the second-best performer are marked by the ‘*’ and ‘+’ tegrated framework of both collaborative filtering method symbol in each position. and content-based filtering method.
re-
1375
Approach
‘*’ denote statistically significant results over the second best (‘+’)
6.1. Offline or Coarse-level evaluation 5.6.2. Comparison with other freely available on-line services Venue-prediction accuracy of DISCOVER is measured We also compare our results against EJF and SJS in on both CS and BIO domains at different recommended terms of metrics discussed in Sec. 5.3. 1415 ranks (@3, @6, @9, @12, and @15) in Table 5 and Table 6 respectively. Prediction accuracy of DISCOVER is the (a) EJF: The system uses NLP and Okapi BM25 to recbest among all at all levels in the domain of both CS and ommend journals based on title and abstract of the BIO. Also, the scores are statistically significant from the seed paper [38]. second-best scores. Our approach is not biased either in favor of CS or (b) SJS: It is also a freely available online service which1420 against BIO. Both the collections are also comparable (CS could provide journals recommendation based on the has 15,641,658 papers, while BIO 14,785,486 papers). But input title, abstract and field of study of the seed at early positions, the system performs inferior for BIO paper. due to possibly the following reason. We observe that in For comparison, recommendations from DISCOVER are1425 BIO domain higher number of papers are shortlisted after restricted to only Elsevier and Springer journals. abstract similarity calculation (70-110 in comparison to 60100 in CS) leading to the higher number of journals in the candidate set of recommendations (Table 16). BIO jour6. Results and discussions nals are found to have larger scope covering diverse topics The performance of DISCOVER against the existing1430 of papers. Hence it becomes difficult for DISCOVER to correctly predict the original journal at early ranks as a state of the art methods and freely available on-line serlot of BIO journals share overlapping scopes. The phevices EJF and SJS respectively are reported. For clarity nomenon is substantially less prominent in CS dataset, and easy understanding, we provide the results and discuspossibly leading to better performance there. sion in two steps as given below. We also conduct pairedFB and CF methods exhibit the worst performances and samples t-test on overall precision, nDCG, Accuracy, and1435 are unable to predict at all at the 3-recommendations level. MRR for both CS and BIO between DISCOVER and the As far as MRR scores are concerned, DISCOVER displays second-best performers. Only p values less than 0.05 were the best, and in case of BIO, the score is more than double considered statistically significant at 5% level of signifiof its nearest competitor. cance (α =0.05).
lP
1370
Table 5: Accuracy@k and MRR results comparison of FOS “CS”
urn a
1365
user and a citation would represent an item. This method used an item-based collaborative filtering approach to identify a set of candidate papers in a given paper-citation matrix. Later on, we apply LDA on all extracted abstract, title to compute the similarity among seed paper and candidate papers. Finally, the set of papers having high content similarity are identified, and their respective venues are recommended. We also tried to use researcher-paper citation relationship to populate the rating matrix and also tried other ways of combining CBF and CF, but the way used in this method performs the best.
Jo
1360
20
Journal Pre-proof Data Mining
Image Processing
0.9
Wireless Networks 1.0
0.8 0.8
0.6 0.5 0.4
0.9
0.7
Precision@K
Precision@K
Precision@K
0.7
0.6 0.5
0.8 0.7 0.6
0.4
0.5 3
6
9
Top K papers
12
15
0
Artificial Intellignece
0.90
0.9
6
9
Top K papers
0.65
3
6
9
Top K papers
12
15
DISCOVER CF+CBF
6
9
Top K papers
12
15
0.8 0.7 0.6
0.50 0
3
Security
Precision@K
Precision@K
0.70
0.55 0.5
0
0.9
0.75
0.60
0.6
15
1.0
0.80
0.7
12
Parallel and distributed systems
0.85
0.8
Precision@K
3
pro of
0
0.5
0
3
6
9
Top K papers
PVR PAVE
12
15
CBF RWR
0
3
6
9
Top K papers
12
15
FB CF CN
re-
Figure 12: Sub-domain wise precision@k calculation (CS)
1450
Table 7: F-measure (F1 ) analysis of FOS “CS” F1 @3
F1 @6
F1 @9
F1 @12
F1 @15
FB CF CN CBF CF+CBF RWR PVR PAVE DISCOVER
0.004 0.007 0.010 0.013 0.016 0.018 0.023 0.059+ 0.147*
0.021 0.029 0.041 0.049 0.053 0.058 0.064 0.119+ 0.197*
0.032 0.037 0.068 0.083 0.098 0.103 0.194 0.243+ 0.363*
0.029 0.034 0.056 0.079 0.093 0.109 0.176 0.213+ 0.341*
0.024 0.031 0.052 0.075 0.088 0.102 0.153 0.196+ 0.335*
1455
lP
Approach
CS and BIO. F1 scores are generally seen to increase with rank up to a certain point (around 9-12) and drop after that. This is possible since precision and recall both increase till that point until the original venues are retrieved, causing an increase in the F1 score. However, with further increase in ranks, precision drops sharply without much increase in recall leading to an overall drop in F1 scores. Here also DISCOVER outperforms in terms of F1 measure in comparison to other state-of-the-art methods.
‘*’ denote statistically significant results over the second best (‘+’)
6.2. Online or Finer-level evaluation
urn a
Table 8: F-measure (F1 ) analysis of FOS “BIO” Approach
F1 @3
F1 @6
F1 @9
F1 @12
F1 @15
FB CF CN CBF CF+CBF RWR PVR PAVE DISCOVER
0.000 0.000 0.007 0.009 0.012 0.014 0.018 0.053+ 0.119*
0.018 0.026 0.038 0.041 0.046 0.048 0.059 0.108+ 0.186*
0.031 0.039 0.071 0.075 0.078 0.097 0.183 0.237+ 0.327*
0.030 0.038 0.063 0.068 0.064 0.093 0.168 0.218+ 0.314*
0.027 0.034 0.058 0.067 0.059 0.091 0.147 0.193+ 0.308*
1460
1465
1440
1445
Jo
‘*’ denote statistically significant results over the second best (‘+’)
We have also investigated the efficacy of the proposed1470 model DISCOVER in terms of F − measuremacro (F1 ) on both CS and BIO domains, as defined in Eqn. 22. Note that here, precision is considered only for the original venues, i.e., non-zero precision comes only if a system within top-15 recommendations recommends the original1475 venue. In both CS and BIO domains, DISCOVER outperforms other state-of-the-art methods at all ranks (Table 7 and Table 8). Similarly, the second-best performance is exhibited by PAVE, whereas FB performs the worst in both 21
The performances of different systems along with DISCOVER in individual sub-domains of CS (12 sub-domains) and BIO (6 sub-domains) according to different metrics are discussed below. 6.2.1. Precision@k The precision scores of DISCOVER and other state-ofthe-art methods of 12 sub-domains under FOS CS are shown in Fig. 12 and Fig. 13 and for 6 sub-domains of BIO in Fig. 14. DISCOVER outperforms other state-ofthe-art methods consistently according to precision@k in 9 sub-domains, namely, CV, ML, IR, NLP, IP, WSN, AI, PDS, and SC. It exhibits an average performance in domains SE, MM, and DM. In the case of BIO, DISCOVER outperforms others in all 6 sub-domains: CB, AN, IM, TX, BI, and PL. To compare against freely available online services such as EJF and SJS, recommendations of DISCOVER are restricted to only the Elsevier and Springer journals (See Fig. 15 and Fig. 16). In CS, precision@k scores of DISCOVER (Elsevier) exceed that of EJF in 10 sub-domains
Journal Pre-proof
Multimedia
Computer Vision 0.675
0.75
0.650
0.70
0.625
0.65 0.60
0.9
0.600 0.575 0.550
0.500 0
3
6
9
Top K papers
12
15
0.5
0
Information Retrieval
3
6
9
Top K papers
Precision@K
0.7 0.6 0.5
0.70 0.65 0.60
0.50 0
3
6
9
Top K papers
12
15
0
6
9
Top K papers
12
15
0.7 0.6
3
6
9
Top K papers
PVR PAVE
12
0.4
15
0
3
CBF RWR
6
9
Top K papers
12
15
12
15
12
15
FB CF CN
re-
DISCOVER CF+CBF
3
Natural Language Processing
0.5
0.55
0.4
0
Precision@K
0.75
0.8
15
0.8
0.80
0.9
12
Software Engineering
0.85
1.0
0.7 0.6
0.525
0.50
0.8
pro of
0.55
Precision@K
1.0
Precision@K
0.80
0.3
Machine Learning
0.700
Precision@K
Precision@K
0.85
Precision@K
0.8 0.7
0.75
0.70 0.65 0.60
3
6
9
Top K papers
12
15
0.45
0
0.75
6
9
Top K papers
12
0
Bio Chemistry
0.50
0
3
6
9
Top K papers DISCOVER CF+CBF
15
9
0.75
0.7
0.70 0.65 0.60 0.55
0.5 12
6
Top K papers
Paleontology
0.80
0.6
0.55
3
0.85
Precision@K
Jo
0.60
Precision@K
0.65
0.60
15
0.8
0.70
0.65
0.50 3
0.9
0.80
0.70
0.55
Toxicology
0.85
Precision@K
0.75
0.50
0
0.45
0.80
0.55
0.6
Immunology
0.85
0.80
urn a
Precision@K
0.9
Anatomy
0.85
Precision@K
Computational Biology 1.0
lP
Figure 13: Sub-domain wise precision@k calculation (CS)
0.50 0
3
6
9
Top K papers
PVR PAVE
12 CBF RWR
15
0
3
6
FB CF CN
Figure 14: Sub-domain wise precision@k calculation (BIO)
22
9
Top K papers
Journal Pre-proof Artificial Intellignece
0.4 0.3
Parallel and distributed systems 0.60
DISCOVER(Springer) DISCOVER(Elsevier) Springer Elsevier
0.9 0.8
Precision@K
0.5
Precision@K
Information Retrieval 1.0
DISCOVER(Springer) DISCOVER(Elsevier) Springer Elsevier
0.55 0.50
Precision@K
0.6
0.7 0.6 0.5
0.30
0.3
0.25
0.2 3
6
9
Top K papers
12
15
0
3
Software Engineering
0.8 0.7 0.6
0.6 0.4 0.2
0.4 3
6
9
Top K papers
12
15
0
15
0.0
3
6
9
Top K papers
12
15
Image Processing
0.9
DISCOVER(Springer) DISCOVER(Elsevier) Springer Elsevier
0.8
0.5
0
12
Wireless Networks
Precision@K
Precision@K
Top K papers
1.0
DISCOVER(Springer) DISCOVER(Elsevier) Springer Elsevier
0.9
9
DISCOVER(Springer) DISCOVER(Elsevier) Springer Elsevier
0.8 0.7
Precision@K
1.0
6
pro of
0
0.40 0.35
0.4
0.2
DISCOVER(Springer) DISCOVER(Elsevier) Springer Elsevier
0.45
0.6 0.5 0.4 0.3 0.2
0
3
6
9
Top K papers
12
15
0
3
6
9
Top K papers
12
15
1490
lP
1485
(AI, IR, PDS, WSN, IP, MM, SC, NLP, ML, and DM) ex-1500 at position (P@3) but does better than CBF at all other cept in SE and CV sub-domains where DISCOVER could positions. FB method performs the worst compared to all not show good performances. other methods except at position 6 (P@6). Similarly DISCOVER (Springer) exceeds SJS in 10 subdomains (AI, PDS, WSN, IP, CV, MM, SC, NLP, ML, and Table 10: Overall precision (P@k) results (BIO) DM) except IR and SE (Fig. 15 and Fig. 16). Methods P@3 P@6 P@9 P@12 P@15 In the case of BIO, DISCOVER (Elsevier) exceeds EJF FB 0.527 0.541 0.551 0.549 0.561 in 5 sub-domains (IM, AN, TX, BI, and PL) other than CF 0.540 0.549 0.556 0.557 0.563 domain CB (Fig. 17). However, DISCOVER (Springer) CN 0.598 0.606 0.605 0.604 0.611 CBF 0.639 0.628 0.627 0.622 0.622 outperforms SJS in all sub-domains of BIO. Interestingly, RWR 0.652 0.641 0.620 0.612 0.611 DISCOVER (Springer) exhibits the best performance in CF+CBF 0.653 0.654 0.641 0.634 0.626 PVR 0.670 0.661 0.637 0.629 0.632 CB there. + + + + +
urn a
1480
re-
Figure 15: Sub-domain wise precision@k calculation (CS)
P@3
P@6
FB CF CN CBF RWR CF+CBF PVR PAVE DISCOVER
0.541 0.578 0.615 0.634 0.638 0.659 0.689+ 0.666 0.778*
0.581 0.574 0.625 0.618 0.620 0.642 0.666 0.671+ 0.733*
P@9
P@12
P@15
0.583 0.584 0.629 0.612 0.615 0.641 0.643 0.648+ 0.697*
0.585 0.598 0.631 0.605 0.613 0.630 0.631 0.635+ 0.685*
0.590 0.604 0.632 0.609 0.615 0.627 0.622 0.637+ 0.684*
1505
Jo
‘*’ denote statistically significant results over the second best (‘+’)
1495
0.718 0.794*
0.682 0.753*
0.656 0.727*
0.645 0.717*
0.641 0.706*
‘*’ denote statistically significant results over the second best (‘+’)
Table 9: Overall precision (P@k) results (CS) Methods
PAVE DISCOVER
1510
6.2.2. Overall results of precision@k When we compute the overall precision taking the average of precision values over 12 sub-domains of CS at a given rank (3, 6, 9, 12 and 15), DISCOVER outshines1515 other methods at all ranks (Table 9). PAVE is the secondbest performer and mostly outplays other baseline methods. As the second-best, PVR outperforms PAVE only for precision@3. Among the low-performers, CN fares badly 23
For BIO, the same exercise was done (Table 10) with similar results. DISCOVER is consistently better than all other methods at all positions, and PAVE is the secondbest performer among other baseline methods. 6.2.3. nDCG@k As explained earlier, nDCG captures the performance for graded relevance of venues. In terms of nDCG, DISCOVER is ahead of other state-of-the-art methods consistently in 9 sub-domains (CV, ML, IR, NLP, IP, WSN, AI, PDS, and SC) (Fig. 18 and Fig. 19) except SE, MM, and DM. DISCOVER shows consistent performances in terms of precision@k and nDCG@k in 9 sub-domains out of 12 sub-domains in CS. For BIO, (Fig. 20), DISCOVER defeats other state-ofthe-art methods consistently in 5 sub-domains such as CB, AN, BI, TX, and PL except for IM where it shows an average performance.
Journal Pre-proof
Computer Vision 0.9 0.8
Precision@K
0.8 0.7 0.6
0.7 0.6 0.5
0.3 0
3
6
9
Top K papers
12
15
3
6
0.6 0.5
0.9
0.4
0.8 0.7
0.3 12
3
6 9 Top K papers
15
12
15
Data Mining
1.0
DISCOVER(Springer) DISCOVER(Elsevier) Springer Elsevier
0.9 0.8 0.7 0.6
0.4
0
3
6 9 Top K papers
12
15
0.3
0
3
6 9 Top K papers
re-
6 9 Top K papers
0
0.5
0.5 3
15
DISCOVER(Springer) DISCOVER(Elsevier) Springer Elsevier
0.6
0
12
Precision@K
0.7
9
Top K papers
1.0
Precision@K
Precision@K
0.8
0.6
Machine Learning
DISCOVER(Springer) DISCOVER(Elsevier) Springer Elsevier
0.9
0.8
0.2
0
Natural Language Processing 1.0
DISCOVER(Springer) DISCOVER(Elsevier) Springer Elsevier
0.4
0.4
0.5
Security 1.0
DISCOVER(Springer) DISCOVER(Elsevier) Springer Elsevier
Precision@K
0.9
Precision@K
Multimedia 1.0
DISCOVER(Springer) DISCOVER(Elsevier) Springer Elsevier
pro of
1.0
12
15
12
15
12
15
Computational Biology
lP
Figure 16: Sub-domain wise precision@k calculation (CS)
Immunology
Anatomy
0.725
0.85 0.80
0.70
0.675
Precision@K
Precision@K
0.700
urn a
Precision@K
0.90
0.75
0.650 0.625 0.600
0.550
0
3
6
9
Top K papers
12
15
0
Toxicology
0.75
6
9
Top K papers
12
15
0
Jo
0.60
Precision@K
0.65
0
3
6
9
Top K papers Elsevier
12
15
9
Paleontology 0.80
0.75 0.70
0.78 0.76 0.74
0.60
0.50
6
Top K papers
0.82
0.65
0.55
3
Bio Chemistry
0.85 0.80
0.70
Precision@K
0.50 3
Precision@K
0.70
0.60 0.55
0.575
0.75
0.65
0.72 0
3
6
9
Top K papers
Springer
12
15
0
DISCOVER(Elsevier)
3
6
DISCOVER(Springer)
Figure 17: Sub-domain wise precision@k calculation (BIO)
24
9
Top K papers
Journal Pre-proof
0.80
0.75
0.75
0.70
0.70
0.65
0.9
0.65
0.60
0.60
0.55
0.55
0.50 3
6
9
Top K papers
12
15
6
9
Top K papers
0.9
0.80
0.8
0.75
0.7 0.6
0.70 0.65
0.50 6
9
12
0
15
0
6
9
Top K papers
0.9
12
15
0.7 0.6
3
6
9
Top K papers
PVR PAVE
12
0.4
15
0
3
6
9
Top K papers
CBF RWR
12
15
FB CF CN
re-
DISCOVER CF+CBF
3
Natural Language Processing
0.5
0.55
0.4
Top K papers
15
0.8
0.60
0.5
3
12
nDCG@K
0.85
nDCG@K
nDCG@K
3
Software Engineering
1.0
0
0.7
0.5
0
Information Retrieval
0.3
0.8
0.6
0.50 0
Machine Learning 1.0
nDCG@K
0.80
nDCG@K
nDCG@K
Computer Vision
0.85
pro of
Multimedia
0.85
lP
Figure 18: Sub-domain wise nDCG@k calculation (CS)
Data Mining
0.9
Image Processing
0.8
0.9
0.6 0.5
nDCG@K
nDCG@K
urn a
0.6
0.7
nDCG@K
Wireless Networks 1.0
0.8
0.4 0.2
0.4
3
6
9
Top K papers
12
15
0
Artificial Intellignece
0.9
0.5 3
6
9
Top K papers
12
15
0
3
Parallel and distributed systems
6
9
Top K papers
12
15
12
15
Security
1.0
0.85
0.9
0.80
Jo
0.7
6
9
Top K papers DISCOVER CF+CBF
0.65
0.50 12
15
0.8 0.7 0.6
0.55
0.5
3
0.70
0.60
0.6
0
0.75
nDCG@K
nDCG@K
0.8
nDCG@K
0.90
0.7 0.6
0.0
0
0.8
0.5 0
3
6
9
Top K papers
PVR PAVE
12
15
0
3
CBF RWR
Figure 19: Sub-domain wise nDCG@k calculation (CS)
25
6
9
Top K papers FB CF CN
Journal Pre-proof Computational Biology
Anatomy
0.85
0.75
Immunolgy 0.80
0.80
0.60
0.75
0.75
0.70
0.70
nDCG@K
0.65
nDCG@K
nDCG@K
0.70
0.65 0.60
0.60
0.55
0.55
0.50
0.55
0.45
0.50 0
3
6
9
Top K papers
12
15
0.50 0
3
Biochemistry
6
9
Top K papers
12
15
0
Taxonomy
0.8
3
6
9
Top K papers
12
15
Paleontology
0.85
0.75
pro of
0.80
0.70
0.75
0.60 0.55
nDCG@K
0.7
0.65
nDCG@K
nDCG@K
0.65
0.6
0.70 0.65 0.60 0.55
0.5
0.50
0.50 0
3
6
9
Top K papers
12
15
DISCOVER CF+CBF
0.4
0
3
6
9
Top K papers
PVR PAVE
12
0.45
15
0
3
CBF RWR
6
9
Top K papers
12
15
FB CF CN
1525
In terms of nDCG@k, both DISCOVER (Elsevier) and DISCOVER (Springer) do much better their counterparts, namely EJF and SJS in 10 sub-domains (MM, SC, ML, IP, DM, IR, WSN, SE, PDS, and AI) other than SC and CV in the CS domain (Fig. 21 and Fig. 22). For BIO, DISCOVER (Elsevier) excels over EJF in all 6 sub-domains while DISCOVER (Springer) outranks SJS in 4 sub-domains (CB, BI, IM and AN) except TX and PL (Fig. 23).
Table 12: Overall nDCG@k results (BIO)
lP
1520
re-
Figure 20: Sub-domain wise nDCG@k calculation (BIO)
Methods
nDCG@3
nDCG@6
nDCG@9
nDCG@12 nDCG@15
FB CF CN CBF RWR CF+CBF PVR PAVE DISCOVER
0.538 0.541 0.575 0.656 0.648 0.664 0.631 0.685+ 0.760*
0.544 0.558 0.589 0.658 0.642 0.661 0.631 0.686+ 0.742*
0.543 0.571 0.617 0.666 0.654 0.675 0.642 0.678+ 0.752*
0.579 0.601 0.628 0.657 0.637 0.669+ 0.657 0.660 0.743*
0.578 0.609 0.626 0.642 0.627 0.643 0.653+ 0.648 0.740*
‘*’ denote statistically significant results over the second best (‘+’)
Table 11: Overall nDCG@k results (CS) nDCG@3
nDCG@6
FB CF CN CBF RWR PVR CF+CBF PAVE DISCOVER
0.541 0.583 0.616 0.643 0.644 0.687 0.688+ 0.672 0.783*
0.586 0.588 0.624 0.640 0.646 0.686 0.691+ 0.688 0.765*
nDCG@9
nDCG@12 nDCG@15
urn a
Methods
0.622 0.625 0.647 0.664 0.674 0.702 0.709 0.714+ 0.770*
0.687 0.693 0.712 0.717 0.735 0.753 0.730 0.771+ 0.802*
Table 13: Diversity (D) of DISCOVER and other approaches
0.772 0.784 0.802 0.804 0.812 0.828 0.779 0.840+ 0.872*
‘*’ denote statistically significant results over the second best (‘+’)
1535
1540
D (CS)
D (BIO)
FB CF CN CBF RWR CF+CBF PVR PAVE DISCOVER
0.227 0.387 0.281 0.219 0.312 0.394 0.403+ 0.327 0.519*
0.241 0.355 0.274 0.206 0.322 0.369 0.397+ 0.319 0.503*
‘*’ denote statistically significant results over the second best (‘+’)
6.2.4. Overall results of nDCG@k Average nDCG@k over 12 sub-domains of CS are depicted in Table 11. DISCOVER exceeds all other baseline methods with PAVE being the second-best except at position 3 and 6 (nDCG@3 and nDCG@6). CF+CBF performs better than PAVE only at positions 3 and 6. Afterward its shows a lower nDCG than both PVR and PAVE methods. Among the rest, RWR performs better than1545 CBF, CN, CF, and FB with FB being the worst performer. In BIO, there is a clean sweep for DISCOVER in terms of nDCG irrespective of positions (Table 12). PAVE exhibits the second-highest performance.
Jo
1530
Methods
26
6.3. Evaluation of diversity Diversity is defined in terms of content dissimilarity. We group all papers published at a particular venue and extract their corresponding keywords. We apply the similarity score in Eqn. 26 in the definition of diversity (Table 13). DISCOVER is seen to show the best diversity, and its performance gap with the second-best (PVR) is statistically significant (at 5% level) for both CS and BIO as shown in Table 13.
Journal Pre-proof
Information Retrieval 0.95 0.90 0.85 0.80 0.75
0.7 0.6
0.60 0
3
6
9
Top K papers
12
15
0
Parallel and distributed systems
6
9
Top K papers
0.8
DISCOVER(Springer) DISCOVER(Elsevier) Springer Elsevier 3
6
9
Top K papers
12
3
0.7
6
9
Top K papers
12
15
Computer Vision
1.00
DISCOVER(Springer) DISCOVER(Elsevier) Springer Elsevier
0.95 0.90 0.85 0.80
0.5
0.75
15
0
3
6
9
Top K papers
12
15
0
3
re-
0
0
DISCOVER(Springer) DISCOVER(Elsevier) Springer Elsevier
0.6
0.4
15
nDCG@K
0.7
nDCG@K
0.9
0.5
12
0.75
Artificial Intellignece
0.8
0.6
3
0.85 0.80
DISCOVER(Springer) DISCOVER(Elsevier) Springer Elsevier
0.5
0.90
pro of
0.65
DISCOVER(Springer) DISCOVER(Elsevier) Springer Elsevier
0.95
0.8
0.70
nDCG@K
Software Engineering 1.00
0.9
nDCG@K
nDCG@K
Wireless Networks
DISCOVER(Springer) DISCOVER(Elsevier) Springer Elsevier
nDCG@K
1.00
6
9
Top K papers
12
15
lP
Figure 21: Sub-domain wise nDCG@k calculation (CS)
Multimedia
Security
1.0
1.0
nDCG@K
0.7
0.7
DISCOVER(Springer) DISCOVER(Elsevier) Springer Elsevier
0.6 0.5 0
3
6 9 Top K papers
12
0.8
0
3
0.75
6 9 Top K papers
0.65
0
3
6 9 Top K papers
12
15
Data Mining 1.0
DISCOVER(Springer) DISCOVER(Elsevier) Springer Elsevier
0.8
0.9
0.7
DISCOVER(Springer) DISCOVER(Elsevier) Springer Elsevier
0.8 0.7 0.6
0.5
0.65
3
15
0.6
0.70
0
12
nDCG@K
0.80
0.80
Image Processing
0.9
nDCG@K
0.85
Jo
nDCG@K
0.90
6 9 Top K papers
1.0
DISCOVER(Springer) DISCOVER(Elsevier) Springer Elsevier
0.95
0.85
0.70
Machine Learning
1.00
0.90
0.75
0.6
15
DISCOVER(Springer) DISCOVER(Elsevier) Springer Elsevier
0.95
nDCG@K
0.9
0.8
urn a
nDCG@K
0.9
Natural Language Processing 1.00
DISCOVER(Springer) DISCOVER(Elsevier) Springer Elsevier
12
15
0
3
6 9 Top K papers
12
15
0
3
Figure 22: Sub-domain wise nDCG@k calculation (CS)
27
6 9 Top K papers
12
15
Journal Pre-proof Computational Biology
Bio Chemistry 0.9
0.75
0.75 0.70
0.70
nDCG@K
0.80
nDCG@K
0.65 0.60
0.65 0
3
6
9
12
Top K papers
15
0
3
6
12
0.55
0.55 9
12
15
9
12
Top K papers
15
Toxicology
nDCG@K
nDCG@K
0.70
0.60
Top K papers
6
0.850
0.75
0.60
6
3
0.875
0.65
3
0
0.900
0.80
0.75
0
15
Anatomy
0.80
nDCG@K
9
Top K papers
0.85
0.65
0.7 0.6
Immunology
0.70
0.8
pro of
nDCG@K
0.85
0.60
Paleontology
0.80
0.90
0.825 0.800 0.775 0.750 0.725
0
Elsevier
3
6
9
12
Top K papers
Springer
15
0
3
DISCOVER(Elsevier)
6
9
12
Top K papers
15
DISCOVER(Springer)
Table 14: Stability (MAS) of DISCOVER and other approaches Methods
MAS (CS)
MAS (BIO)
FB CF CN CBF RWR PVR CF+CBF PAVE DISCOVER
9.961 8.936 9.784 5.887 8.992 8.236 5.639 8.863 4.758*
9.864 8.873 9.862 6.045 9.137 8.179 5.582 8.761 4.695*
80 75 70 65 60
2
4
6
8
10
12
Top N Recommendation
PAVE DISCOVER CF+CBF
‘*’ denote statistically significant results over the second best (‘+’)
CBF RWR PVR
14 FB CF CN
Figure 24: Average venue quality (CS)
Jo
1560
H5-Index
lP
1555
6.3.1. Evaluation of stability BIO dataset in detail (Fig. 12, Fig. 13, Fig. 14, Fig. 15, We have also provided a comprehensive investigation of1570 Fig. 16,Fig. 17, Fig. 18, Fig. 19, Fig. 20, Fig. 21, Fig. 22, Fig. 23). the stability of the proposed DISCOVER, as defined in Eqn. 27. DISCOVER shows the minimum MAS than all other standard approaches (Table 14). It shows a MAS of 4.758 for CS, meaning that on an average, every predicted venue will shift by a position of 4.758 after adding new 95 data into the training data of the system. Similarly, it 90 shows a MAS of and 4.695 for BIO. We have considered the average MAS-score as a threshold to decide whether a 85 particular method provides stability or not.
urn a
1550
re-
Figure 23: Sub-domain wise nDCG@k calculation (BIO)
6.4. Study of the proposed approach
Here we revisit the subqueries (SQs) pertaining to the broad RQs that we started with along with our observations. 1565
6.4.1. SQ1: How effective is DISCOVER in comparison1575 to other state-of-the-art methods? We see the performance of DISCOVER vis-a-vis the other methods for each of the subdomains of CS and 28
Also we measure the overall performance of DISCOVER over all sub-domains taken together. The overall results of precision@k, nDCG@k, accuracy and MRR as shown in Table 5, Table 6, Table 9, Table 10, Table 11, and Table 12 are statistically significant (paired-samples t-test at α=0.05) over other approaches in both the domains of CS and BIO.
Journal Pre-proof 110 100
100
95 90
H5-Index
80
75
60
70 65
2
4
6
8
10
12
Top N Recommendation
PAVE DISCOVER CF+CBF
CBF RWR PVR
14 FB CF CN
6
8
10
12
Top N Recommendation
14
Springer DISCOVER(Springer)
quite high, compared to that from other methods. In the case of BIO, it is even higher, possibly due to less number of journals and a lesser amount of overlap in their scope of publications. DISCOVER (Elsevier) recommending similar venues with EJF except for few positions (1st and 7th) recommendation in the domain of CS where DISCOVER is better (Fig. 26). The average H5-index of DISCOVER (Elsevier) is 75 whereas the average H5-index of EJF is 71. Similarly, the H5-index recommended by DISCOVER (Springer) are of higher quality than that by SJS. The average H5-index of DISCOVER (Springer) is 43 whereas the average H5index of SJS is 39. Similarly DISCOVER (Elsevier) performs better than EJF with an average H5-index of 94 in the domain of BIO (Fig. 27). The average H5-index of EJF is 81. Similarly DISCOVER (Springer) shows higher venue quality with an average H5-index of 78 where as SJS shows an average H5-index of 71.
lP
re-
H5-Index
70 60 50 40 30 2
4
urn a
80
6
8
10
12
Top N Recommendation
1615
1620
14
SPRINGER DISCOVER(SPRINGER)
Jo
ELSEVIER DISCOVER(ELSEVIER)
1625
Figure 26: Average venue quality of EJF and SJS(CS)
1595
4
Figure 27: Average venue quality of EJF and SJS (BIO)
6.4.2. SQ2: How is the quality of venues recommended by1600 DISCOVER? The venues recommended by DISCOVER are of high quality as compared to other state of the art methods. including EJF and SJS recommendations (Fig. 24, Fig. 25, Fig. 26 and Fig. 27). 1605 The average H5-index of DISCOVER shows the highest average value of 97 while recommending first venues, 86 while recommending the 3rd venue, then slightly downgrades and ends with a H5-index of 70 at position 15 in domain CS as depicted in Fig. 24. 1610 90
1590
2
Elsevier DISCOVER(Elsevier)
Figure 25: Average venue quality (BIO)
1585
80
70
50
1580
85
pro of
H5-Index
90
In BIO, DISCOVER shows the highest average H5-index of 108 while recommending the 7th venue as shown in Fig. 25. The least H5-index of 83 is exhibited by DISCOVER at position 15. PAVE is the second-best per-1630 former in H5-index in both CS and BIO. Model FB recommends the worst quality of venues than other methods in both CS and BIO. In both domains, quality of recommendations of venues by DISCOVER, as given by H5-index is 29
6.4.3. SQ3: How does DISCOVER handle cold-start issues for new researchers and new venues and other issues? (i) Cold-start issues: We take the average MRR of both CS and BIO in all 6 categories mentioned in Sec. 5.4. The analysis in Table 15 shows that, even if the seed paper related to a new venue and new researcher, DISCOVER could predict the original venue at early ranks. It does not require past publication records or co-authorship networks for the recommendations. It considers only the current area of interest along with the title, keywords, and abstract as inputs to recommend the same. (ii) Data sparsity: To specifically address data sparsity issue, both significance and relevance parameters are taken into consideration at the early stage of the proposed model. Social network analysis through various centrality measures and content features of a pa-
Journal Pre-proof
Table 15: MRR results of proposed DISCOVER and other approaches Approach
15<=pc 0.027 0.029 0.030 0.038 0.046 0.041 0.044 0.109+ 0.171*
pro of
MRR 2<=vc <8 8<=vc <15 15<=vc 2<=pc <8 8<=pc <15 FB 0.017 0.025 0.029 0.019 0.023 CF 0.024 0.026 0.030 0.022 0.027 CN 0.026 0.029 0.035 0.023 0.032 CBF 0.032 0.039 0.038 0.027 0.037 CF+CBF 0.038 0.049 0.056 0.044 0.043 RWR 0.039 0.046 0.048 0.028 0.038 PVR 0.042 0.051 0.056 0.035 0.042 PAVE 0.096+ 0.108+ 0.115+ 0.096+ 0.104+ DISCOVER 0.147* 0.176* 0.180* 0.164* 0.169* ‘*’ denote statistically significant results over the second best (‘+’)
Table 16: Step-wise papers filtration of both CS and BIO Steps
No. of papers (CS)
Original Dataset Data preprocessing (Training) Keyword-based search Centrality measure calculation Co-citation score computation Main path analysis Abstract Matching
1650
1655
1660
1665
14,785,486 9,961,893 4k-11.5k 1.5k-4k 500-1.5k 55-95 70-110
sity in the result set. First, a hybrid binary tree architecture with keyword-based search strategy is used irrespective of publishers. We use both link and content similarity-based techniques at a number of places. Also, the main path analysis, that traces the most significant paths in a citation network captures conceptually related papers to a given seed paper. Integration of these approaches can provide recommendations from diverse publishers, as evidenced by Table 13. DISCOVER shows the highest value of D (diversity) as compared to all other approaches.
re-
lP
1645
(iii) Computational costs: In DISCOVER reduction of computational costs has been prioritized. It applies the main path analysis to extract only relevant and conceptually related papers. This step is performed after the computation of centrality measures and co-1685 citation scores, where a large number of papers are already filtered out from the citation network. Hence the computational overhead does not substantially increase with the increase in the number of papers.
urn a
1640
per like abstract, title, and keywords were exploited1670 to capture the strength of both significance and relevance, respectively. It has been observed that the average number of papers found after keyword matching are in the range of 5k-13k and 4k-11.5k for CS and BIO, respectively. Table 16 displays the stepwise1675 filtration of papers of both CS and BIO. After the initial filtering, we are left with meaningful papers for further computation, which are close to the area of interest. Hence there is no data sparsity issue in our proposed approach, as mentioned in Table 17. 1680
As shown in Table 16, there is more than 90% reduc-1690 tion after the initial step of the keyword-based search, and there is a substantial reduction in further steps as well. It is to be noted that the average number of papers involved after main path analysis in CS and BIO are around 45-85 and 55-95 respectively to perform 1695 abstract similarity-based contextual features matching. Table 16 shows the average number of papers involved in each step of the proposed approach. We believe that the proposed system will show a satisfactory performance with a larger dataset. Even a substantial increase in dataset size will not impact the overall computation time by much and therefore1700 does not suffer from scalability issue.
Jo
1635
15,641,658 10,424,960 5k-13k 2k-5k 800-2k 45-85 60-100
No. of papers (BIO)
(iv) Diversity: Several steps are taken to ensure diver30
(v) Stability: We build a content-aware recommender system based on the title, keywords, and abstract similarities. During the initial stages, centrality measures are calculated, and thereafter, the textual content similarity is computed to find the related papers. Ranking of venues is done at a very later stage from a collection of related papers filtered out in a long pipeline. The addition of new papers, therefore, do not affect the order of recommendations. In all these batteries of techniques together provide stability to the recommendations. DISCOVER shows the minimum MAS than all other standard approaches (Table 14).
The overall comparison on various issues, including cold start issues, are listed in Table 17. 6.5. Some insights Overall good scores discussed in Sec. 6.2.2 and Sec. 6.1 showcase the efficacy of the proposed DISCOVER for venue recommendation. However, there are few limitations as follows. (i) The proposed system may not recommend relevant venues with less than 3 to 5 domain-specific keywords.
Journal Pre-proof
Table 17: Issues involved in DISCOVER and other compared approaches
(iii) 1715
(iv) 1720
(v) 1725
Stability
yes no yes yes yes no no yes yes yes no
yes yes yes no yes no yes yes no no no
pro of
Diversity
no yes no no no yes yes no no no no
As a result, DISCOVER displays average performance and co-citation analysis, topic modeling based contextual similarity, and main path analysis of a bibliographic citain terms of precision@k in few sub-domains like SE, 1745 tion network. MM, and DM as depicted in Fig. 12 and Fig. 13. To assist in identifying relevant research outlets, conIf there are an insufficient number of related papers, textual similarity through a hybrid approach of both topic the proposed system may fail to capture the relevant modeling and matrix factorization techniques are adopted. papers resulting in possibly irrelevant venue recomWe conducted an extensive set of experiments on a realmendations. Due to these constraints, DISCOVER1750 world dataset: MAG, and demonstrated that DISCOVER exhibits the worst performance of precision against consistently outperforms the state-of-the-art methods and EJF in sub-domains ML, DM and against SJS in subother freely available online services such as EJF and SJS. domains IR and SE as depicted in Fig 12 and Fig. 13. On two different domains of field (CS and BIO), DISCOVER shows significantly better scores of precision@k, The proposed approach displays the worst nDCG 1755 nDCG@k, accuracy, MRR, F − measuremacro , diversity, against EJF as depicted in Fig. 22. The minimum and stability than other state-of-the-art methods. DISnumber of related papers of a specific sub-domain reCOVER also suggests high-quality venues as compared to quired for good recommendation is found to be in the state-of-the-art methods and other freely available online range of 1k-2k. services such as EJF and SJS in terms of H5-index. During content similarity, we extract keywords from1760 Nonetheless, there is scope for future study in this dithe papers only. If other researchers do not use these rection. We plan to experiment with other datasets and to keywords, that can cause a problem in related paper extend it to multiple disciplines with the goal of improvextraction at the initial stage of the proposed model ing accuracy, diversity, novelty, coverage, and serendipity. and may fail to recommend relevant venues. We would also like to investigate the same with the help 1765 of heterogeneous bibliographic information network with The proposed system hence exhibits the worst nDCG meta path features. against both EJF and SJS in sub-domain CV as depicted in Fig. 22. The system could recommend relevant venues if the citation network is strongly con8. Compliance with ethical standards nected.
re-
1710
Sparsity
yes (new researcher) yes (researcher and venue) yes (new venue) yes(new venue) yes (new researcher) yes (researcher and venue) yes (researcher and venue) yes(new researcher) yes(new venue) yes(new venue) no
lP
(ii)
Cold-start
FB CF CN CBF RWR CF+CBF PVR PAVE EJF SJS DISCOVER
urn a
1705
Methods
7. Conclusion and future research
1735
1740
Academic venue recommendation is an emerging area of research in recommendation systems. The set of proposed techniques are few in numbers, and they suffer from several problems. One of the major issues is that of cold9. References start having two sub-parts: that for new venues and new researchers. Also, there exist other issues of sparsity, di[1] D. Liang, L. Charlin, J. McInerney, D. M. Blei, Modeling user exposure in recommendation, in: Proceedings of the 25th Interversity, and stability that are hitherto not adequately ad1775 national Conference on World Wide Web, International World dressed by existing state-of-the-art methods. Wide Web Conferences Steering Committee, 2016, pp. 951–961. This paper proposes a diversified yet integrated social [2] G. Adomavicius, A. Tuzhilin, Toward the next generation of network analysis and contextual similarity-based scholarly recommender systems: A survey of the state-of-the-art and possible extensions, IEEE transactions on knowledge and data envenue recommender (DISCOVER) system that reasonably 1780 gineering 17 (6) (2005) 734–749. addresses all the above-mentioned issues. It is developed [3] J. Bobadilla, F. Ortega, A. Hernando, A. Guti´ errez, Recomtaking into account recent advances in social network analmender systems survey, Knowledge-based systems 46 (2013) ysis incorporating centrality measure calculation, citation 109–132.
Jo
1730
1770
The authors declare no conflicts of interests. The article uses social network analysis and topic modeling to recommend publication venues for a new paper. The article does not contain any studies with human or animal subjects.
31
Journal Pre-proof
1810
1815
1820
1825
1830
1835
1840
1845
1850
pro of
1805
re-
1800
[25] P. Lops, M. De Gemmis, G. Semeraro, Content-based recommender systems: State of the art and trends, in: Recommender systems handbook, Springer, 2011, pp. 73–105. [26] N. M. Villegas, C. S´ anchez, J. D´ıaz-Cely, G. Tamura, Characterizing context-aware recommender systems: A systematic literature review, Knowledge-Based Systems 140 (2018) 173– 200. [27] R. Klamma, P. M. Cuong, Y. Cao, You never walk alone: Recommending academic events based on social network analysis, Complex Sciences (2009) 657–670. [28] Z. Yang, B. D. Davison, Distinguishing venues by writing styles, in: Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries, ACM, 2012, pp. 371–372. [29] Z. Yang, B. D. Davison, Venue recommendation: Submitting your paper with style, in: Machine learning and applications (ICMLA), 2012 11th international conference on, Vol. 1, IEEE, 2012, pp. 681–686. [30] T. Huynh, K. Hoang, Modeling collaborative knowledge of publishing activities for research recommendation, Computational collective intelligence. Technologies and applications (2012) 41– 50. [31] J. Yu, K. Xie, H. Zhao, F. Liu, Prediction of user interest based on collaborative filtering for personalized academic recommendation, in: Computer Science and Network Technology (ICCSNT), 2012 2nd International Conference on, IEEE, 2012, pp. 584–588. [32] A. J. Trappey, C. V. Trappey, C.-Y. Wu, C. Y. Fan, Y.-L. Lin, Intelligent patent recommendation system for innovative design collaboration, Journal of Network and Computer Applications 36 (6) (2013) 1441–1450. [33] M. Kochen, R. Tagliacozzo, Matching authors and readers of scientific papers, Information Storage and Retrieval 10 (5-6) (1974) 197–210. [34] E. Medvet, A. Bartoli, G. Piccinin, Publication venue recommendation based on paper abstract, in: Tools with Artificial Intelligence (ICTAI), 2014 IEEE 26th International Conference on, IEEE, 2014, pp. 1004–1010. [35] M. Errami, J. D. Wren, J. M. Hicks, H. R. Garner, etblast: a web server to identify expert reviewers, appropriate journals and similar publications, Nucleic acids research 35 (suppl 2) (2007) W12–W15. [36] M. J. Schuemie, J. A. Kors, Jane: suggesting journals, finding experts, Bioinformatics 24 (5) (2008) 727–728. [37] W. Liang, X. Zhou, S. Huang, C. Hu, X. Xu, Q. Jin, Modeling of cross-disciplinary collaboration for potential field discovery and recommendation based on scholarly big data, Future Generation Computer Systems 87 (2018) 591–600. [38] N. Kang, M. A. Doornenbal, R. J. Schijvenaars, Elsevier journal finder: recommending journals for your paper, in: Proceedings of the 9th ACM Conference on Recommender Systems, ACM, 2015, pp. 261–264. [39] Z. Lu, N. Xie, W. J. Wilbur, Identifying related journals through log analysis, Bioinformatics 25 (22) (2009) 3038–3039. [40] H. Luong, T. Huynh, S. Gauch, L. Do, K. Hoang, Publication venue recommendation using author networks publication history, Intelligent Information and Database Systems (2012) 426–435. [41] M. Hornick, P. Tamayo, Extending recommender systems for disjoint user/item sets: The conference recommendation problem, IEEE Transactions on Knowledge and Data Engineering 24 (8) (2012) 1478–1490. [42] I. Boukhris, R. Ayachi, A novel personalized academic venue hybrid recommender, in: Computational Intelligence and Informatics (CINTI), 2014 IEEE 15th International Symposium on, IEEE, 2014, pp. 465–470. [43] E. Minkov, B. Charrow, J. Ledlie, S. Teller, T. Jaakkola, Collaborative future event recommendation, in: Proceedings of the 19th ACM international conference on Information and knowledge management, ACM, 2010, pp. 819–828. [44] J. Tang, S. Wu, J. Sun, H. Su, Cross-domain collaboration recommendation, in: Proceedings of the 18th ACM SIGKDD inter-
lP
1795
urn a
1790
[4] X. Kong, H. Jiang, Z. Yang, Z. Xu, F. Xia, A. Tolba, Exploiting1855 publication contents and collaboration networks for collaborator recommendation, PloS one 11 (2) (2016) e0148492. [5] F. Xia, Z. Chen, W. Wang, J. Li, L. T. Yang, Mvcwalker: Random walk-based most valuable collaborators recommendation exploiting academic factors, IEEE Transactions on Emerging1860 Topics in Computing 2 (3) (2014) 364–375. [6] Y. Sebastian, E.-G. Siew, S. O. Orimaye, Learning the heterogeneous bibliographic information network for literature-based discovery, Knowledge-Based Systems 115 (2017) 66–79. [7] G. Wang, X. He, C. I. Ishuga, Har-si: A novel hybrid article1865 recommendation approach integrating with social information in scientific social network, Knowledge-Based Systems. [8] H. Liu, X. Kong, X. Bai, W. Wang, T. M. Bekele, F. Xia, Context-based collaborative filtering for citation recommendation, IEEE Access 3 (2015) 1695–1703. 1870 [9] W. Huang, Z. Wu, L. Chen, P. Mitra, C. L. Giles, A neural probabilistic model for context based citation recommendation., in: AAAI, 2015, pp. 2404–2410. [10] X. Liu, J. Zhang, C. Guo, Citation recommendation via proximity full-text citation analysis and supervised topical prior,1875 IConference 2016 Proceedings. [11] Q. He, D. Kifer, J. Pei, P. Mitra, C. L. Giles, Citation recommendation without author supervision, in: Proceedings of the fourth ACM international conference on Web search and data mining, ACM, 2011, pp. 755–764. 1880 [12] J. Beel, B. Gipp, S. Langer, C. Breitinger, paper recommender systems: a literature survey, International Journal on Digital Libraries 17 (4) (2016) 305–338. [13] Z. Yang, D. Yin, B. D. Davison, Recommendation in academia: A joint multi-relational model, in: Advances in Social Networks1885 Analysis and Mining (ASONAM), 2014 IEEE/ACM International Conference on, IEEE, 2014, pp. 566–571. [14] S. Yu, J. Liu, Z. Yang, Z. Chen, H. Jiang, A. Tolba, F. Xia, Pave: Personalized academic venue recommendation exploiting co-publication networks, Journal of Network and Computer Ap-1890 plications 104 (2018) 38–47. [15] F. Xia, W. Wang, T. M. Bekele, H. Liu, Big scholarly data: A survey, IEEE Transactions on Big Data 3 (1) (2017) 18–35. [16] J. Lu, D. Wu, M. Mao, W. Wang, G. Zhang, Recommender system application developments: a survey, Decision Support1895 Systems 74 (2015) 12–32. [17] H. Alhoori, R. Furuta, Recommendation of scholarly venues based on dynamic user interests, Journal of Informetrics 11 (2) (2017) 553–563. [18] M. C. Pham, D. Kovachev, Y. Cao, G. M. Mbogos, R. Klamma,1900 Enhancing academic event participation with context-aware and social recommendations, in: Advances in Social Networks Analysis and Mining (ASONAM), 2012 IEEE/ACM International Conference on, IEEE, 2012, pp. 464–471. [19] K. Sugiyama, M.-Y. Kan, Towards higher relevance and1905 serendipity in scholarly paper recommendation by kazunari sugiyama and min-yen kan with martin vesely as coordinator, ACM SIGWEB Newsletter (Winter) (2015) 4. [20] Z. Chen, F. Xia, H. Jiang, H. Liu, J. Zhang, Aver: random walk based academic venue recommendation, in: Proceedings of the1910 24th International Conference on World Wide Web, ACM, 2015, pp. 579–584. [21] H. Alhoori, How to identify specialized research communities related to a researcher’s changing interests, in: Digital Libraries (JCDL), 2016 IEEE/ACM Joint Conference on, IEEE, 2016, pp.1915 239–240. [22] A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-j. P. Hsu, K. Wang, An overview of microsoft academic service (mas) and applications, in: Proceedings of the 24th international confer1920 ence on world wide web, ACM, 2015, pp. 243–246. [23] D. Herrmannova, P. Knoth, An analysis of the microsoft academic graph, D-Lib Magazine 22 (9/10). [24] X. Kong, Y. Shi, S. Yu, J. Liu, F. Xia, Academic social networks: Modeling, analysis, mining and applications, Journal of 1925 Network and Computer Applications.
Jo
1785
32
Journal Pre-proof
1940
[48]
1945
[49]
1950
[50]
[51] 1955
[52]
[53] 1960
[54] [55] 1965
[56] [57] 1970
[58] [59] 1975
[60]
[61] 1980
[62]
[63] 1985
[64]
1990
[65]
[66] 1995
[67]
[70]
pro of
[47]
[69]
of information retrieval: development and comparative experiments: Part 2, Information processing & management 36 (6) (2000) 809–840. A. Trotman, A. Puurula, B. Burgess, Improvements to bm25 and language models examined, in: Proceedings of the 2014 Australasian Document Computing Symposium, ACM, 2014, p. 58. L. Egghe, R. Rousseau, Co-citation, bibliographic coupling and a characterization of lattice citation networks, Scientometrics 55 (3) (2002) 349–361. K.-K. Lai, S.-J. Wu, Using the patent co-citation approach to establish a new patent classification system, Information processing & management 41 (2) (2005) 313–330. J. Son, S. B. Kim, Academic paper recommender system using multilevel simultaneous citation networks, Decision Support Systems 105 (2018) 24–33. E. Elmacioglu, D. Lee, On six degrees of separation in dblp-db and more, ACM SIGMOD Record 34 (2) (2005) 33–40. J. Guare, Six degrees of separation: A play, Vintage, 1990. M. E. Newman, The structure of scientific collaboration networks, Proceedings of the national academy of sciences 98 (2) (2001) 404–409. Y. Xiao, L. Y. Lu, J. S. Liu, Z. Zhou, Knowledge diffusion path analysis of data quality literature: A main path analysis, Journal of Informetrics 8 (3) (2014) 594–605. J. S. Liu, C.-H. Kuan, A new approach for main path analysis: Decay in knowledge diffusion, Journal of the Association for Information Science and Technology 67 (2) (2016) 465–476. N. P. Hummon, P. Dereian, Connectivity in a citation network: The development of dna theory, Social networks 11 (1) (1989) 39–63. D. M. Blei, A. Y. Ng, M. I. Jordan, Latent dirichlet allocation, Journal of machine Learning research 3 (Jan) (2003) 993–1022. S. Wu, Applying the data fusion technique to blog opinion retrieval, Expert Systems with Applications 39 (1) (2012) 1346– 1353. S. Wu, S. McClean, Performance prediction of data fusion for information retrieval, Information processing & management 42 (4) (2006) 899–915. J. H. Lee, Analyses of multiple evidence combination, in: ACM SIGIR Forum, Vol. 31, ACM, 1997, pp. 267–276. M. Montague, J. A. Aslam, Relevance score normalization for metasearch, in: Proceedings of the tenth international conference on Information and knowledge management, ACM, 2001, pp. 427–433. H. Stuckenschmidt, Approximate information filtering on the semantic web, in: Annual Conference on Artificial Intelligence, Springer, 2002, pp. 114–128. M. Sokolova, G. Lapalme, A systematic analysis of performance measures for classification tasks, Information processing & management 45 (4) (2009) 427–437. K. Bradley, B. Smyth, Improving recommendation diversity, in: Proceedings of the Twelfth Irish Conference on Artificial Intelligence and Cognitive Science, Maynooth, Ireland, Citeseer, 2001, pp. 85–94. M. Kunaver, T. Poˇ zrl, Diversity in recommender systems–a survey, Knowledge-Based Systems 123 (2017) 154–162. G. Adomavicius, J. Zhang, Improving stability of recommender systems: a meta-algorithmic approach, IEEE Transactions on Knowledge and Data Engineering 27 (6) (2014) 1573–1587. C. Desrosiers, G. Karypis, A comprehensive survey of neighborhood-based recommendation methods, in: Recommender systems handbook, Springer, 2011, pp. 107–144.
[71]
[72] [73] [74]
[75]
[76]
re-
1935
[68]
[77]
[78] [79]
lP
[46]
urn a
1930
Jo
[45]
national conference on Knowledge discovery and data mining, ACM, 2012, pp. 1285–1293. F. Xia, N. Y. Asabere, J. J. Rodrigues, F. Basso, N. Deonauth, W. Wang, Socially-aware venue recommendation for conference2000 participants, in: Ubiquitous Intelligence and Computing, 2013 IEEE 10th International Conference on and 10th International Conference on Autonomic and Trusted Computing (UIC/ATC), IEEE, 2013, pp. 134–141. S. Cohen, L. Ebel, Recommending collaborators using key-2005 words, in: Proceedings of the 22nd International Conference on World Wide Web, ACM, 2013, pp. 959–962. F. Yu, A. Zeng, S. Gillard, M. Medo, Network-based recommendation algorithms: A review, CoRR abs/1511.06252. arXiv:1511.06252. 2010 URL http://arxiv.org/abs/1511.06252 W. H. Hsu, A. L. King, M. S. Paradesi, T. Pydimarri, T. Weninger, Collaborative and structural recommendation of friends using weblog-based social network analysis., in: AAAI Spring Symposium: Computational Approaches to Analyzing2015 Weblogs, Vol. 6, 2006, pp. 55–60. T. Silva, J. Ma, C. Yang, H. Liang, A profile-boosted research analytics framework to recommend journals for manuscripts, Journal of the Association for Information Science and Technology 66 (1) (2015) 180–200. 2020 M. C. Pham, Y. Cao, R. Klamma, Clustering technique for collaborative filtering and the application to venue recommendation, in: Proc. of I-KNOW, Citeseer, 2010. M. C. Pham, Y. Cao, R. Klamma, M. Jarke, A clustering approach for collaborative filtering recommendation using social2025 network analysis., J. UCS 17 (4) (2011) 583–604. H. P. Luong, T. Huynh, S. Gauch, K. Hoang, Exploiting social networks for publication venue recommendations., in: KDIR, 2012, pp. 239–245. M. F. Porter, Snowball: A language for stemming algorithms2030 (2001). B. Zhu, S. Watts, H. Chen, Visualizing social network concepts, Decision Support Systems 49 (2) (2010) 151–161. Y. Liang, Q. Li, T. Qian, Finding relevant papers based on citation relations, in: International Conference on Web-Age In-2035 formation Management, Springer, 2011, pp. 403–414. L. C. Freeman, Centrality in social networks conceptual clarification, Social networks 1 (3) (1978) 215–239. T. Opsahl, F. Agneessens, J. Skvoretz, Node centrality in weighted networks: Generalizing degree and shortest paths, So-2040 cial networks 32 (3) (2010) 245–251. P. Bonacich, Some unique properties of eigenvector centrality, Social networks 29 (4) (2007) 555–564. J. M. Kleinberg, Authoritative sources in a hyperlinked environment, Journal of the ACM (JACM) 46 (5) (1999) 604–632.2045 S. Peng, Y. Zhou, L. Cao, S. Yu, J. Niu, W. Jia, Influence analysis in social networks: A survey, Journal of Network and Computer Applications 106 (2018) 17–32. U. Brandes, A faster algorithm for betweenness centrality, Journal of mathematical sociology 25 (2) (2001) 163–177. 2050 F. Grando, L. C. Lamb, Computing vertex centrality measures in massive real networks with a neural learning model, in: 2018 International Joint Conference on Neural Networks (IJCNN), IEEE, 2018, pp. 1–8. W. Richards, A. Seary, Eigen analysis of networks, Journal of2055 Social Structure 1 (2) (2000) 1–17. T. Pedersen, S. Patwardhan, J. Michelizzi, Wordnet:: Similarity: measuring the relatedness of concepts, in: Demonstration papers at HLT-NAACL 2004, Association for Computational Linguistics, 2004, pp. 38–41. Z. Wu, M. Palmer, Verbs semantics and lexical selection, in: Proceedings of the 32nd annual meeting on Association for Computational Linguistics, Association for Computational Linguistics, 1994, pp. 133–138. R. Real, J. M. Vargas, The probabilistic basis of jaccard’s index of similarity, Systematic biology 45 (3) (1996) 380–385. K. S. Jones, S. Walker, S. E. Robertson, A probabilistic model
33
[80]
[81] [82]
[83]
[84]
[85]
[86] [87]
[88]
Journal Pre-proof
pro of
Tribikram Pradhan received his M.Tech in Software Technology from VIT University, Vellore, Tamil Nadu, India in the year 2013. He is currently pursuing PhD in the Department of Computer Science and Engineering, Indian Institute of Technology (BHU), Varanasi, India. Prior to this, he worked as an assistant professor in the Department of Information and Communication Technology (ICT), Manipal Institute of Technology, Manipal, India from 20132016. His primary research interests include the areas of Recommender System, Social Network Analysis, NLP, and Machine Learning.
Jo
urn a
lP
re-
Sukomal Pal received his M.Tech and Ph.D. from Department of Computer Science and Engineering, Indian Statistical Institute, Kolkata in the year 2005 and 2012 respectively. He joined as an assistant professor in the Department of Computer Science and Engineering, Indian Institute of Technology (BHU), Varanasi, India in the year 2016. Prior to this, he has worked as an assistant professor in the Department of Computer Science and Engineering, Indian Institute of Technology (ISM), Dhanbad from 2010-2015. He authored a book entitled "Sub-document level Information Retrieval: Retrieval and Evaluation, LAP LAMBERT Academic Publishing". His research interests include the areas of Information Retrieval, Recommender Systems, Text Mining, and Data Science.
Jo
urn a
lP
re-
pro of
Journal Pre-proof
Jo
urn a
lP
re-
pro of
Journal Pre-proof
Journal Pre-proof Highlights We propose DISCOVER: a scholarly venue recommendation system. Our work provides an integrated framework of social network analysis and content-based features.
pro of
It addresses cold start issues, data sparsity, diversity, and stability issues.
Jo
urn a
lP
re-
Experiments on Microsoft Academic Graph (MAG) dataset shows that, DISCOVER outperforms state-of-the-art methods.
Journal Pre-proof
Conflict of Interest and Authorship Conformation Form Please check the following as appropriate: All authors have participated in (a) conception and design, or analysis and interpretation of the data; (b) drafting the article or revising it critically for important intellectual content; and (c) approval of the final version.
o
This manuscript has not been submitted to, nor is under review at, another journal or other publishing venue.
o
The authors have no affiliation with any organization with a direct or indirect financial interest in the subject matter discussed in the manuscript
o
The following authors have affiliations with organizations with direct or indirect financial interest in the subject matter discussed in the manuscript:
re-
pro of
o
Author’s name
lP
Indian Institute of Technology Banaras Hindu University Indian Institute of Technology Banaras Hindu University
Jo
urn a
Tribikram Pradhan Dr. Sukomal Pal
Affiliation