Expert Systems With Applications 136 (2019) 242–251
Contents lists available at ScienceDirect
Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa
Expert knowledge recommendation systems based on conceptual similarity and space mapping Li Gao a,∗, Kun Dai b, Liping Gao b, Tao Jin b a b
University of Shanghai for Science and Technology, Library department, 516 Jungong Rd, Shanghai, China University of Shanghai for Science and Technology, School of Optical-Electrical and Computer Engineering, 516 Jungong Rd, Shanghai, China
a r t i c l e
i n f o
Article history: Received 30 March 2019 Revised 21 May 2019 Accepted 6 June 2019 Available online 14 June 2019 Keywords: Conceptual similarity Space mapping Core resource database (CRD) Institutional repository (IR) Expert Knowledge Recommendation System (EKRS)
a b s t r a c t The semantic analysis method of structured big data generated based on human knowledge is important in expert recommendation systems and scientific and technological information analysis. In these fields, the most important problem is the calculation of concept similarity. The study aims to explore the spatial mapping relationship between the general knowledge base and the professional knowledge base for the application of the general knowledge map in professional fields. With the core resource database (CRD) as the main body of the general knowledge and the institutional repository (IR) as the main body of the professional knowledge, the conceptual features of institutional expert knowledge were firstly abstracted from IR and inferred from small-scale datasets and the mathematical model was established based on the similarity of text concepts and related ranking results. Then, a two-set concept space mapping algorithm between CRD and IR was designed. In the algorithm, the more granular concept nodes were extracted from the information on the shortest paths among concepts to obtain a new knowledge set, the Expert Knowledge Recommendation System (EKRS). Finally, the simulation experiment was carried out with open datasets to verify the algorithm. The simulation results showed that the algorithm reduced the structural complexity in the calculation of large datasets. The proposed system model had a clear knowledge structure and the recommended accuracy of the text similarity was high. For small-scale knowledge base datasets with different sparsity, the system showed the stable performance, indicating the better convergence and robustness of the algorithm. © 2019 Published by Elsevier Ltd.
1. Introduction Human thinking mode is to reason, analyze and judge a problem based on own experience and knowledge. Simulating human thinking modes by semantic similarity evaluation remains a long challenge in the field of artificial intelligence. Therefore, the fundamental way to solve this problem is to endow the computer with a knowledge structure model similar to that of human beings by using the knowledge base. Institutional repository (IR), is a knowledge platform on which universities can realize long-term preservation of their academic achievements, promote academic dissemination and share research results of institutions of higher learning. After years of exploration, based on modern information technologies, with enriched and refined functions, Institutional repository (IR) has become a new knowledge base platform for online services in the field of library
∗
Corresponding author. E-mail addresses:
[email protected] (L. Gao),
[email protected] (K. Dai),
[email protected] (L. Gao),
[email protected] (T. Jin). https://doi.org/10.1016/j.eswa.2019.06.013 0957-4174/© 2019 Published by Elsevier Ltd.
and information science in recent years (Jie, Jiayong, Ling & Ruifan, 2015; Ma, 2017). In the whole framework of human knowledge, concept is the most important basic knowledge unit and the similarity calculation of various concepts is an important means to construct knowledge recommendation systems. However, most of existing frameworks is aimed at a single independent knowledge base network and do not take into account the potential association with other similar knowledge networks (Cao, Zhang & Yong, 2018; Wang, Gibbins, Payne & Patelli, 2013). Due to the coarse granularity of knowledge in the individual knowledge base and the sparse knowledge content, it is difficult to reflect the full picture of a knowledge point. To explore the relationship between two conceptual entries, it is necessary to acquire tremendous background knowledge in related fields and establish the concept space mapping relationship with the core resource datasets through network links. After the semantic similarity analysis, the integration of knowledge can help us to overcome the obstacle in human thinking simulation based on knowledge and background experiences for reasoning and association. Therefore, we proposed a new algorithm based on concept similarity and space mapping, called Two-set Concept Space Mapping
L. Gao, K. Dai and L. Gao et al. / Expert Systems With Applications 136 (2019) 242–251
Algorithm (Two-set refers to IR dataset and CRD dataset). In this algorithm, general knowledge was extracted from core resource database, CRD (e.g. Elsevier, WOS, and CNKI), and professional knowledge was extracted from institutional repository (IR). Taking full advantage of the information on the shortest path among concepts, the algorithm extracted the concept nodes with finer granularity. In this way, we constructed the Expert Knowledge Recommendation System (EKRS) and improved the accuracy and coverage of pure computation-related concepts. It is a dynamic and timely system with constantly updating knowledge association and can track the characteristics of scholars’ professional interests and provide a valuable basis for researchers and learners. The main contributions of this paper are summarized as follows. Firstly, this paper constructed a mathematical model, designed the algorithm through two-set concept space mapping process, and attempted to carry out mathematical modeling of language. Secondly, this study developed the rational mapping between the professional knowledge base and the general knowledge base in order to solve the application problem of the general knowledge map in professional fields. Thirdly, the expert knowledge recommendation system (EKRS) proposed in this paper is an active recommendation system for knowledge discovery and knowledge services. The rest of the paper is organized as follows. Section 2 presents a brief review of related works of this paper. Section 3 provides a problem description of two-set concept space mapping algorithmsand mathematics model. The algorithm design is presented in Section 4. In Section 5, the simulation experiment of the algorithm was carried out with open datasets. Conclusions are presented in the final section.
2. Related studies Current mainstream recommendation algorithms include content-based recommendation (De Angelis & Dias, 2014; Dias, José & Ramos, 2014), collaborative filtering recommendation (Pham, Vuong, Thai, Tran & Ha, 2016), rule-based recommendation (Xiao-Wen, Ming, Ji-Tao & Chang-Sheng, 2016), and so on. Most of these recommendation algorithms are mainly applied in social networks, rarely in knowledge networks. In this paper, we attempted to extend the conceptual similarity computation and link-based correlation algorithm to knowledge recommendation systems. Among all the knowledge bases compiled by human, Encyclopedia Britannica (EB) and Wikipedia are typical knowledge bases. Conceptual similarity algorithms based on large-scale knowledge bases include content-based algorithms and link-based algorithms. The method of display language analysis collects concepts through Wikipedia and establishes matrix relation between the contents of text and the concepts collected, but it lacks the use of human knowledge system (De Angelis & Dias, 2014). The vector of the concept is used to measure the leapfrogging relationship between the two concepts through the analysis of the presentation language (Junhua, Wanli & Zhao, 2015). With Wikipedia entries as ontologies, some scholars combined keywords with topics and texts in three ways, but they did not consider the path relationship between entries (Fiorelli, Pazienza, Stellato & Turbati, 2014). Concept and classification diagram is an automatic method of subject classification, but it cannot solve the problem of limited number of classification and concept belonging to multiple classifications (Tan, Guan & Cai, 2014). The method of semantic correlation evaluation is based on the text content, which establishes a semantic converter to establish the weight vector of the concept, and evaluates the semantic correlation by comparing the vectors. In the relationship between concepts, only the relationship between the
243
categories is considered, but the path of hyperlink formation is ignored (Al-Hassan, Lu & Lu, 2015). Concept similarity algorithms are based on concept vector, path, probability, and link. The classical methods based on vector, such as accurate semantic analysis, judge their correlation by comparing the weight vector (weighted vector) of important concepts (Kushwaha & Vyas, 2014; Yanes, Ben Sassi & Hajjami Ben Ghézala, 2015). The concept vector judges the correlation of two concepts by comparing the vectors projected to the corresponding classes (Calvo, Méndez & Moreno-Armendáriz, 2016). A path-based measure that projects concepts to the corresponding classes and compares the similarity of concepts through the shortest path between concepts (Liu, Zhang & Hu, 2019; Kaliraj, Bharathi, 2018). The algorithms based on probability can judge the relevance according to the probability distribution of simulated human clicks, such as random walks (Masuda, Porter & Lambiotte, 2016) and an improved random walks (Diel & Lerasle, 2017). The link-based algorithm is based on the link in the concept network, such as the standardized link distance (France, Carroll & Xiong, 2012) and link vector similarity (Elliott, Siu & Fung, 2014). Although various algorithms for calculating conceptual correlation are available (Elliott et al., 2014; France et al., 2012; Kushwaha & Vyas, 2014; Junhua, Wanli & Zhao, 2015; Yanes et al., 2015), most of them are only applicable to the concepts of medium distance and long distance, namely, the concepts with significant differences. However, in our experiment, the compared concepts were extracted from institutional repository (IR) and most of them belonged to the same category, so the above-mentioned algorithms did not work well. Besides, the experiments of these algorithms were based on large-scale knowledge bases and the number of the training sets of conceptual nodes was between 1 million and 15 million. Therefore, a great quantity of data labeling and distributed training on multiple machines were required. The training model was complex and the range of model representation was limited (Lin, Libo, Tiejian & Qiyang, 2017; Libo, Yihan & Tiejian, 2017). 3. Problem descriptions and mathematical model 3.1. Problem descriptions of two-set concept space mapping algorithms The general knowledge set obtained from core resource database (CRD) and the professional knowledge set obtained from the Institutional Repository (IR) were linked through the network, and then a new knowledge concept set, expert knowledge recommendation system (EKRS), was constructed by concept space mapping. This paper proposes a two-set conceptual space mapping model by network link between CRD and IR, as shown in Fig. 1. In order to describe the process of EKRS based on two-set Concept Space Mapping Algorithms effectively, we established the following definitions. Definition 1. Academic achievement text set T is obtained from institutional repository (IR) and concept word ω is extracted from it. Sim(T1 ,T2 ) is calculated with concept similarity and the ordered concept data set V is formed after sorting by the shortest path between concepts. Definition 2. The general knowledge set obtained from CRD. Its text set T’ and concept word ω are sorted after Sim(T1 ,T2 ) is calculated with concept similarity and the ordered concept set H is formed after sorting. Definition 3. A new concept mapping set G (EKRS) can be formed by linking the professional knowledge set V with the general knowledge set H, |V∩H|.
244
L. Gao, K. Dai and L. Gao et al. / Expert Systems With Applications 136 (2019) 242–251
3.2. Mathematics model The corresponding relationship of definitions, assumptions, equations, constrains and objective functions are summarized in Table 2 for clarity. From Definition 1 and Assumption 1, a complete and accurate dataset of academic achievements of experts and scholars can be obtained from IR and the texts can be extracted as T meeting Eq. (1).
T = {Text1 , Text2 , . . . , Textn },
(1)
From Assumption 2, with the extracted texts including notional words as concept words ω, we select any two texts of an expert’s work in recent years are selected as T1 and T2 : meeting Eqs. (2)– (3) m,i
T1 = {ω1 , ω2 , · · · , ωmi } =
ω,
(2)
n,j ω, ω1 , ω2 , · · · , ωnj =
(3)
m=1,i=1
T2 =
n=1, j=1
where m denotes the number of concept words in T1 ; i denotes the occurrence times of ω in T1 ; n denotes the number of concepts in T2 ; j denotes the occurrence times of ω in T2 . From Assumption 3, the concept word ω in T1 and T2 is matched with the words in thesaurus to obtain the word vector relation meeting Eqs. (4)-(7):
Ami = {ω1 , ω2 , · · · , ωmi }, Bnj =
ω1 , ω2 , · · · , ωnj ,
⎡
Fig. 1. Mapping process of two-set conceptual space.
Based on the definitions of the system operation process, now we make the following assumptions. Assumption 1. in the original dataset obtained from IR, the integrity and accuracy of the institutional data should be maintained. Assumption 2. in the original XML files, the entries and concept words corresponding to hyperlinks in the text (called nodes) are extracted. Assumption 3. the extracted concept words in texts are matched with the saurus and the number of concept nodes is far larger than the number of vocabulary entries. Assumption 4. the identical concept words of dataset are deleted and some non-content pages (such as categorized pages, help pages, and blank pages) are merged with the corresponding pointed pages. Assumption 5. concept word nodes are arranged in the descending order of the weight in text through semantic analysis. Assumption 6. in this study, the semantic similarity among smallscale short texts in the professional field is calculated. Assumption 7. every vocabulary entry IR is named unifiedly and formally and linked with the CRD through network. The nodes from two concept datasets can be matched. The main notations of this work are summarized in Table 1 for clarity.
Ami
(4)
a11 = ⎣ ... am1
⎡
b11 Bnj = ⎣ ... bn1
··· .. . ··· ··· .. . ···
(5)
⎤
a1i .. ⎦, . ami
(6)
⎤
b1j .. ⎦, . bnj
(7)
where the numbers of vocabulary entries ai and bj are much more than ωi . From Assumption 4, assuming that the two text vector sets have k duplicated concept words as Eq. (8).
Ami ∩ Bnj = k.
(8)
The intersection of Amj and Bnj has repeated elements aki and bkj , where the subscripts of the identical concept words are set as k. After removing the identical concept words, we get Ami and Bnj meeting Eqs. (9) and (10) and obtain the sub-matrix (m-k)×(nk):
Ami = a1 , a2 , · · · , a(m−k)i , Bnj = b1 , b2 , · · · , b(n−k)j .
(9) (10)
From Assumption 5, the sub-matrix (m-k)×(n-k) is rearranged according to weight coefficients as constraints meeting the inequalities (11)–(12):
∀ m − m ≤ r ≤ S ≤ m tr ≥ ts ,
(11)
∀ n − n ≤ r ≤ S ≤ n tr ≥ ts ,
(12)
L. Gao, K. Dai and L. Gao et al. / Expert Systems With Applications 136 (2019) 242–251
245
Table 1 Notations used in the paper. Notation
Description
T, T1 , T2
the texts can be obtained from IR or CRD m, n denotes the number of concept wordω in T, i, j denotes the occurrence times concept matrix the weight coefficients of ωr and ωs in T short text set similarity between T1 and T2 the Jaccard similarity coefficient the IR concept mapping set the CRD concept mapping set the EKRS database set
ωmi ωnj
Ami ,Bnj tr , ts ˜ ,B˜ A Sim(T1 ,T2 ) Jc V H G
Table 2 Corresponding table of concepts and formulas. Definitions
Assumptions
Equations
Definition 1
Assumption 1 Assumption 2 Assumption 3
(1) (2)-(3) (4)-(7)
Definition 2
Assumption 4 Assumption 5 Assumption 6
(8) (9)–(10) (13)–(14) (18)(19)
(11)-(12) (15)-(16)
Assumption7
(20)-(21)
(22)
Definition 3
where tr and ts respectively denote the weight coefficients of ωr and ωs in Text1 , t’r and t’s respectively denote the weight coefficients of ω’r and ω’s in Text2. In other words, rows and columns are re-arranged in the descending order according to the weight of each single word. From Assumption 6, let q=min (m,n), that is, there is no identical concept word in short text set , and only similar words exist. Then the square matrix of q × q is obtained as Eqs. (13) and (14):
˜ = a 1 , · · · , aq , A
(13)
B˜ = b 1 , · · · , bq ,
(14)
where a’1 ∈, b’∈. The relevance has nothing to do with the order and meets the inequalities (15) and (16) as constraints:
∀(m − k ) ≤ k ≤ q,
(15)
∀(n − k ) ≤ k ≤ q,
(16)
where the element product of the i th row and the j-th column is Vij. The similarity between T1 and T2 is calculated as Eq. (17) (Definition 1):
q
Sim(T1 , T2 ) =
ω × Vij . ωk
k=1 k q k=1
(17)
The Jaccard similarity coefficient Jc (v) is introduced to distinguish the similarity between concept set V1 (v) and the given concept V1 (c) and sort them with Eq. (18) to form the concept dataset V:
Jc ( v ) =
|V1 (v ) ∩ V1 (c )| |V1 (v ) ∩ V1 (c )| = . |V1 (v ) ∪ V1 (c )| |V1 (v )| + |V1 (c )| − |V1 (v ) ∩ V1 (c )| (18)
The IR concept mapping set is established and sorted according to the similarity of concepts as Eq. (19):
V = {v 1 , v 2 , . . . v n }.
(19)
From Definition 3 and Assumption 7, IR data concept dataset V is link with the CRD dataset H (with Eqs. (1)-(18) to obtain dataset
Constrains
Objective function
(17)
(23)
H) to obtain new concept dataset G, which meets Eqs. (20)-(21) and a constrain as Eq. (22):
H = {h 1 , h , . . . h m },
(20)
G =|V ∩ H| = |V × H|,
(21)
vn < hm ,
(22)
where |V∩H| denotes the space mapping between the two concept sets; |V × H| denotes the number of associated nodes of the two concept sets. The number of the concept nodes in IR concept dataset V is less than the number of the concept nodes in CRD dataset H. Node similarity (NS) is calculated as Eq. (23):
NS =
S(vn , hm ),
(23)
( n , m )∈ R
where m and n are respectively any point in the correlated points R and S is the nodal similarity function. After introducing the Jaccard similarity coefficient Jc (v), combined with the two-set conceptual space mapping algorithm proposed in this paper, the concept mapping is fitted and the concept set G is sorted according to the fitting degree (Eq. (24)
Corrlink c (g ) = γ (g ) ×
Jc ( g ) , NSc(g )
(24)
where γ is the correlation attenuation coefficient of the linked concept and represents the correlation between any concept g and a given concept c. Then, with the similarity functions Eqs. (17) and ((23)), the similarity between the two concept sets is calculated. With Eqs. (1)– (16), (18)–(22), and (24) as hypothesis conditions, a mathematical model of expert knowledge recommendation system (EKRS) is constructed. 4. Algorithm design In order to integrate the concept set of a special field into a whole concept with wider coverage and higher quality and constantly mine and discover knowledge, this paper constructs a twoset concept space mapping algorithm based on the above mathematical model. The design process of the algorithm mainly involves
246
L. Gao, K. Dai and L. Gao et al. / Expert Systems With Applications 136 (2019) 242–251
the selection and preprocessing of data sets, the connectivity of concept nodes, the calculation of the shortest path among nodes and the implementation steps of the algorithm. 4.1. Preprocessing 4.1.1. Dataset IR dataset adopted in the study is an XML file of 51.4 GB obtained after decompressing IR of our university (Chengfu, Nie & Haiyuan, 2014). This file contains all academic papers of the scholars in recent 10 years. The CRD dataset adopted in the study includes the full text database such as WOS and Elsevier (in English), CNKI (in Chinese). In order to realize the link mapping between two datasets, the ‘HowNet’ knowledge system is selected as the thesaurus (Gao Lei, 2015). The 2009 Edition of ‘HowNet’ contains 97,606 Chinese words, 94,386 English words, 112,465 Chinese sememes, 128,680 English sememes, 29,227 concepts and 189,136 records. In Chinese processing, ‘HowNet’-based semantic similarity calculation is the most widely used and mature approach at present. ‘HowNet’ knowledge base has different applications in various fields. For example, the classification system of ‘HowNet’ knowledge can be applied in classification and the concepts in professional fields of CNKI knowledge base can be used in information mining, text clustering and knowledge extraction in professional fields. 4.1.2. Universal connectivity of concept nodes In the path-based semantic analysis, finding a path from one term to another through hyperlinks is based on the assumption that a path exists. Although it can be concluded from experiences that terms in basic network datasets should be generally interconnected, there is a lack of sufficient verification. On the basis of pre-processing, we attempt to verify whether there is the universal connection between any two nodes in IR data set and the CRD set. According to the breadth-first search formula (Eq. (25)), connected components are generated in the undirected graphs composed of nodes and then the distribution of connected components among the nodes is obtained. In the calculation results, it is clear that 99.91% of the nodes are included in the maximum connected components. This means that taking the whole mapping network as an undirected graph, there is a path between any two nodes.
Nbldc (v )=[log (max(|V1 (c )|, |V1 (v )| ) )− log (|V1 (c ) ∩ V1 (v )| )]/ [log (|ω| ) − log(min(|V1 (c )|, |V1 (v )| ) )],
(25)
where V1 (v) denotes any concept in the concept set; V1 (c) denotes the given concept. 4.1.3. Shortest path among concept nodes In order to make full use of the path between concepts and reduce the search time and the complexity of the algorithm, we define the shortest path length between two concepts as (Eq. (26)):
L=
f +b , 2
(26)
where L denotes the length of the shortest path; f denotes the forward length of the shortest path; b denotes the back length of the shortest path. In the evaluation of the semantic similarity through the twoway shortest path, a key problem to be solved is to set the upper limit of the shortest path step. Although we have verified the universal connection of nodes, we should solve the problem that the path is too long when searching for the shortest path between two nodes. To solve this problem, we designed an experiment as follows. Firstly, 10 0 0 pairs of words were randomly selected. Then, the shortest forward path length of each pair of words was obtained with the breadth-first search algorithm. Finally, the average
Fig. 2. Flow chart of the algorithm.
shortest path length between two nodes E(L)=3.84 and the standard deviation σ =1.73 were obtained. According to Chebyshev inequality (Eq. (27)),
P{|L − E (L| ≥ k × σ } ≤
1 k2
(27)
Let k = 3, we got a = 9.03, so the upper limit of step size was set 8, L ≤ 8. The upper limit of step size of the shortest path was set to be 8. If the length exceeds 8 steps, we think that there is no similarity between two words. 4.2. Algorithm construction and implementation steps The flow chart of the algorithm is shown in Fig. 2. The detailed algorithm steps are described as follows. 4.2.1. Concept similarity calculation of IR dataset Step 1: Experts’ academic papers are extracted from IR and each fields of metadata are analyzed. Concept words ωmi and ωnj are extracted from T1 and T2 as the initial points of calculation as Eqs. (2) and (3). Step 2: let i, j = 0, to clear the vector space; Step 3: let i = i + 1, j = j + 1, L ≤ 8 (see Eqs. (26) and (27)) indicates that there are concept words and that the vector distance between the two texts exists; Step 4: ami =bmj (see Eqs. (8)−(10)), the extracted knowledge concept is matched with the saurus of ‘HowNet’ knowledge
L. Gao, K. Dai and L. Gao et al. / Expert Systems With Applications 136 (2019) 242–251
system. If the matching is successful, go to Step 6, else go to the next step; Step 5: Let L = L-1, generate q × q matrix and obtain the vector space matrix according to Eqs. (13) and (14); Step 6: If L = 0 denotes that the spatial distance between two concepts is 0. After similarity matching, the data set V is constructed (see Eqs. (17) and (18)). 4.2.2. Concept similarity calculation of space mapping set between CRD and IR Link matching between IR data set V and CRD data set H.
247
Table 3 Score comparison based on concept similarity assessment of three shortest paths. Paths
f b L
Training Set:Test Set 1:1
3:7
7:3
0.51 0.53 0.84
0.46 0.49 0.77
0.49 0.51 0.80
Step 7: If Vn
Hm, a new concept word is constructed to fill in the ‘HowNet’ knowledge system as a saurus (the content of the follow-up study); Step 8-Step 13: Form data set H according to Step 1-Step 6; Step 14: Return to Step 7; Step 15: After the algorithm ends, the entire extraction process of similar texts is completed and a new network mapping concept set G (EKRS) is obtained (see Eqs. (23) and (24)). Examples: We randomly selected two texts of an expert’s work in recent years as the examples. Step 1: T1: ‘Calculate semantic similarity based on large-scale knowledge repository’ T2 : ‘A knowledge graphical method based on link and semantic association’ ωmi ={large-scale knowledge repository; semantic similarity; Wikipedia; shortest path; connectivity} ωnj ={knowledge schematization; concept topology; word embedding; knowledge representation; Wikipedia} Steps 2–6: The IR concept set V is established and sorted according to the similarity of concepts.
⎡
knowledge representation ⎢ Large scale knowledge ⎢ semantic similarity ⎢ V=⎢ concept topology ⎢ ⎢ word embedding ⎣ ·. ..
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦
Steps 7–14: The CRD concept mapping set H is established and sorted according to the similarity of concepts, Vn
⎡knowledge representation concept similarity ⎢ ⎢ knowledge graph ⎢ ⎢ semantic similarity H=⎢ concept clustering ⎢ ⎢ semanticuser model ⎣ ·. ..
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦
Step 15: A new network mapping concept set G (EKRS) is obtained: G=|V∩H|. The corresponding texts are recommended to the expert. {A review of large-scale network association; Mining categorical sequences from data with a hybrid clustering method; Sentiment Analysis and User Similarity for Social Recommender System; Word semantic similarity measurement based on Naïve Bayes model;…..}
Fig. 3. Clustering diagram of conceptual similarity for three data sets.
5. Experiments and algorithmic performance analysis 5.1. Experimental platforms 5.1.1. Concept similarity calculation of IR and CRD In two groups of experiments of the concept similarity calculation, the same hardware and software platform was adopted. The hardware configuration of the experimental platform is provided as follows: CPU Intel® CoreTM i5-8250 U 1.80 GHz; Memory, 8 GB; Operating System, Windows 10. All programs of the algorithm are written in Python language. Based on Python 3.6, the main Python modules used include Dependent Parser, SpaCy 2.0.0 1; Mathematical Computing Library, Numpy 1.21.1 2; Word Embedding Vector, ‘HowNet’ Knowledge System. 5.1.2. IR experimental platform Hardware configuration of IR experimental platform is provided as follows: CPUE5 2620; RAMDDR3 32 GB; Disk 5T; Main Frequency, 2.6 GHz; Operating System, Windows Server 2012; Software environment, Tomcat 8.0; Apache 2.4.6; JDK 1.8.0_91; .NET Framework, 4; SQL Server, 2016. 5.1.3. CRD experimental platform Hardware configuration of CRD experimental platform is provided as follows: CPU Xeon E5-2603 v3; frequency, 3.2 GHZ; RAMDDR4, 128 GB; disk size SATA, 19.2 TB; operating system, Windows Server 2012. 5.2. Experimental process and analysis 5.2.1. Improvement of the standard dataset After decades of development, the ‘HowNet’ knowledge system data set has increased tens of times in magnitude and some words cannot be mapped to the ‘HowNet’ dataset for path exploration. Therefore, we improved the ‘HowNet’ knowledge system dataset by replacing redirected words and deleting words which were not found in the ‘HowNet’ dataset. On the basis of the training experiment, the average shortest path L in Eq. (26) was changed to 3.5. The test set contains a series of pairs of single sentences and corresponding artificial interpretation results. The results of artificial interpretation are between 0 and 5 points. The interpretation of 5 points indicates that the semantics of the two sentences are identical; the interpretation of 0 point indicates that the semantics
248
L. Gao, K. Dai and L. Gao et al. / Expert Systems With Applications 136 (2019) 242–251 Table 4 Sorting and documentation results of similar words in each dataset. IR
CNKI
Mapping concept dataset in this study (G)
Similar words
Number of documents
Similar words
Number of documents
Similar words
Number of documents
Similarity Similarity degree Calculation method Semantic similarity wordNet Web … …
42 32 21 18 18 15 … …
similarity similarity degree homophyly Thinking mode ontology research Calculation method Algorithm research conceptual metaphor
408 378 374 247 150 144 121 100
similarity conceptual similarity Data mining clustering algorithm Algorithm research Recommendation algorithm Cognitive linguistic …
399 320 284 149 137 116 88 …
Fig. 4. Similarity structure diagram of CNKI concept set.
of the two sentences are completely different; the interpretation from 5 to 0 points indicates that the semantic similarity gradually decreases. The evaluation index is Pearson correlation coefficient between the algorithm output of the test set and the artificial interpretation results (Eq. (28)). The similarity evaluation accuracy of the shortest path algorithm is verified:
N
ρX , Y =
i=1
N
i=1
Xi − X¯ Yi − Y¯
2 N
¯
Xi − X
i=1
2
(28)
Yi − Y¯
where Xi denotes the artificial judgment result of Text; Yi denotes the algorithmic result of Texti . 5.2.2. Experimental process and results 5.2.2.1. Similarity evaluation accuracy of the shortest path algorithm. Three groups of experiments were designed. According to Eq. (26),
f (forward shortest path), b (reverse shortest path), and L (bidirectional shortest path) were combined with artificial scores for prediction. The proportions of the training sets and test sets were respectively set as: 1:1, 3:7, and 7:3 and the average was obtained after repeating for 300 times. The experimental results are shown in Table 3. When f and b were separately used, under three proportions, the maximum similarity reached 0.51. When L was used, the maximum similarity reached 0.84, indicating the higher performance of the algorithm in concept similarity calculation. 5.2.2.2. Clustering effect of concept similarity in three data sets. In order to evaluate the performance of text similarity intuitively, according to Eq. (17) and Eq. (23), document distances of all documents were calculated respectively in three datasets. Here, we chose CNKI full text database as CRD and mapping concept dataset
L. Gao, K. Dai and L. Gao et al. / Expert Systems With Applications 136 (2019) 242–251
249
Fig. 5. Similarity structure diagram of EKRS (G).
G in this study. Then, with t-SNE algorithm (t-distributed Stochastic Neighbor Embedding) (Maaten & Hinton, 2008), the distribution of documents in two-dimensional space was obtained from the three datasets. The performance of distance measurement could be judged by the aggregation degree of the set of points in the same category in t-SNE visualization results. Based on the concept similarity, the concept sets with semantic similarity were selected to test the algorithm in this paper. Table 4 shows the sorting and documentation results of similar words in each dataset. Fig. 3 illustrates the similarity of concepts in terms of the path distance between similar word documents. As shown in Fig. 3(a), the IR concept set is based on the professional field are less than 200 or so. The concept distribution is sparse and the degree of aggregation is better. As shown in Fig. 3(b), the CRD concept set is based on the platform of CNKI (China National Knowledge Infrastructure). The CNKI has about 10 0 0 similar concept nodes with the best coverage of concept similarity, but its similarity sorting ability is weaker than the other two datasets. As shown in Fig. 3(c), the two-set concept space mapping set, EKRS (G) proposed in this paper has about 500 similar concept nodes with the more precise similarity and shows the better degree of similarity aggregation, but the complexity of data
computation is much lower than that of CNKI concept dataset, indicating the higher performance of the algorithm in this paper. The experimental results are consistent with the ranking results of concept similarity of various databases in Table 4. The similarity calculation time was respectively 1.02 s, 1.41 s, and 1.13 s and the test data used in this paper were small samples. The main advantage is the clustering accuracy, but the advantage is not significant in computing power. 5.2.2.3. Graphical analysis of similar concept set structure of CNKI and g. The IR data set is the simplest conceptual structure in the domain of expertise and the comparison study is not required. From Table 4, 33 words similar to “concept similarity” are extracted from CNKI and the similarity order of these words is similarity, similarity degree, homophyly, thinking mode, ontology mapping, calculation method, etc. Twenty-seven words similar to “concept similarity” are extracted from mapping concepts set G and their similarity ranking is similarity, concept correlation, data mining, clustering algorithm, algorithm research, recommendation algorithm, etc. The experimental results showed that in terms of the concepts extracted from the two concept sets, the mapping concept set (G)
250
L. Gao, K. Dai and L. Gao et al. / Expert Systems With Applications 136 (2019) 242–251
Fig. 6. Comparison of the recall ratio and precision ratio of the test data of different recommended text quantities. Table 5 Experimental results of data sets with different sparsity degrees. Experimental data
Results of recommendation
Experiment No
Sparsity of text to be recommended
Number of text to be recommended
Recall ratio
Precision ratio
1
0.05% 2.30% 5.20% 0.05% 2.30% 5.20% 0.05% 2.30% 5.20%
500
41.64% 43.04% 44.92% 42.99% 46.22% 48.19% 44.26% 47.56% 50.98%
42.20% 6.12% 0.98% 44.63% 49.22% 51.87% 46.42% 50.72% 53.02%
2
3
10 0 0
20 0 0
was more consistent with the knowledge structure of experts. As shown in the structure diagrams of Figs. 4 and 5, since the CNKI concepts set is the core resource set with many similar conceptual nodes, the better recall ratio can be achieved by using concept links, but the structure of concept links is more complicated. On the contrary, the mapping concept set of EKRS (G) has been mapped and filtered twice. Therefore, the similarity accuracy is higher. The recommended knowledge structure is simple and the computational complexity is reduced. 5.2.2.4. Performance assessment of EKRS. Accuracy is the parameter to measure knowledge recommendation systems. Accuracy can be measured by the recall ratio and precision ratio of the recom-
mendation results. Accuracy had been proved in the first three experiments. The IR data set belongs to academic knowledge base in professional fields. The sparsity degrees of data sets are different. Text data sets of different professional types were selected in the experiment. In this experiment, the data sets with different sparsity degrees of 0.05%, 2.3% and 5.1% were selected to test the stability of the recommendation system optimized based on the text processing algorithm. The experimental results are shown in Table 5. Fig. 6 shows the recall ratio and precision ratio of the expert knowledge recommendation system (mapping concepts set G) for the test data with different sparsity degrees. Table 5 shows the test data sets with different sparsity degrees (0.05%, 2.3% and 5.1%). After the test data sets were applied as the
L. Gao, K. Dai and L. Gao et al. / Expert Systems With Applications 136 (2019) 242–251
text recommendation system based on the expert knowledge recommendation system (EKRS), with the increase of text data sets from 50 0, 10 0 0 to 20 0 0, the recall ratio and precision ratio of each set of text data sets increased slightly. The accuracy was slightly improved without significant fluctuation. It shows that the EKRS constructed in this paper has good stability in dealing with concept sets with different sparsity degrees and can effectively solve the instability problem of the knowledge recommendation system caused by the sparsity in small-scale data sets. 6. Conclusion and future study Based on conceptual similarity and space mapping in this paper, we constructed a new knowledge mapping set between CRD data set and IR data set. In a sense, the space mapping indicates the changing knowledge transfer way from the small and complete self-supply mode to the central and modular mode. The graphical analysis of the shortest path between concepts represents the knowledge structure. Although it is an idealized idea to some extent, the concept link largely represents the knowledge structure of scholars and provides an important reference for establishing knowledge association data and the knowledge ontology of the own organization and making better use of mathematical models and related algorithms. In the future, based on the platform of IR, we will simulate the process of academic behaviors of scholars, improve and update the IR ontology database, and expand the knowledge value-added service functions of EKRS platform. Conflict of interest statement We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted. Credit authorship contribution statement Li Gao: Conceptualization, Methodology, Validation, Formal analysis, Writing - original draft, Software, Writing - review & editing, Supervision, Funding acquisition. Kun Dai: Data curation, Writing - original draft, Software, Writing - review & editing, Supervision. Liping Gao: Visualization, Investigation, Supervision, Funding acquisition. Tao Jin: Writing - review & editing, Supervision. Acknowledgements The project was supported by the Humanities and Social Science Foundation of University of Shanghai for Science & Technology (No. 1F-18-201-001), and Natural Science Foundation of Shanghai (17ZR1429100). References Al-Hassan, M., Lu, H., & Lu, J. (2015). A semantic enhanced hybrid recommendation approach: A case study of e-government tourism service recommendation system. Decision Support Systems, 72, 97–109.
251
Calvo, H., Méndez, Oscar, & Moreno-Armendáriz, Marco A. (2016). Moreno-Armendáriz.Integrated concept blending with vector space models. Computer Speech and Language, 40(17), 79–96. Cao, X., Zhang, W., & Yong, Y. (2018). A review of large-scale network association. Journal of Shanghai Jiaotong University, 52(10), 1348–1356. De Angelis, L., & Dias, J. G. (2014). Mining categorical sequences from data using a hybrid clustering method. European Journal of Operational Research, 234(3), 720–730. Dias José, G., & Ramos, S. B. (2014). Dynamic clustering of energy markets: An extended hidden Markov approach. Expert Systems with Applications, 41(17), 7722–7729. Diel, R., & Lerasle, M. (2017). Non parametric estimation for random walks in random environment. Stochastic Processes and their Applications, 128(1), S0304414917301254. Elliott, R. J., Siu, T. K., & Fung, E. S. (2014). A double hmm approach to Altman z-scores and credit ratings. Expert Systems with Applications, 41(4), 1553–1560. Fiorelli, M., Pazienza, M-T., Stellato, A., & Turbati, A. (2014). Coda: Computer-aided ontology development architecture. IBM Journal of Research and Development, 58(2/3). 14:1–14:12. France, S. L., Carroll, J. D., & Xiong, H. (2012). Distance metrics for high dimensional nearest neighborhood recovery: Compression and normalization. Information Sciences, 184(1), 92–110. Jie, Z., Jiayong, C., Ling, L., & Ruifang, H. (2015). Research on key technologies of institutional knowledge base based on the law of document data. Information and Information Work, 30(1), 65–70. Kaliraj, MR. S., & Bharathi, A. (2018). Path testing based reliability analysis framework of component based software. Measurement, 144, 20–32. https://doi.org/ 10.1016/j.measurement.2018.11.086. Kushwaha, N., & Vyas, O.P.(2014). Semmovierec: Extraction of semantic features of dbpedia for recommender system. Lei, Gao (2015). Research on knowledge acquisition and ontology base filling based on hownet. Hebei University of Technology. Zhang, L., Sun, Y., & Luo, T. (2017). Calculate semantic similarity based on large scale knowledge repository. Computer Research and Development, 54(11), 2576–2585. Yang, L., Zhang, L., Luo, T., Wan, Q., & Wu, Y. (2017). A knowledge graphical method based on link and semantic association. Computer Research and Development, 54(8), 1655–1664. Liu, Q., Zhang, R., Hu, R., Wang, G., Wang, Z., et al. (2019). An improved path-based clustering algorithm. Knowledge-Based Systems, 163, 69–81. Ma, S. (2017). Institutional knowledge base: Library service innovation platform. Library Science Research, 31(2), 58–63. Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605. Masuda, N., Porter, MA., & Lambiotte, R. (2016). Random walks and diffusion on networks. Physics Reports, 716–717. Pham, TN., Vuong, TH., Thai, TH., Tran, MV., & Ha, QT. (2016). Sentiment analysis and user similarity for social recommender System: An experimental study. Information Science and Applications (ICISA). Singapore: Springer. Tan, S., Guan, Z., & Cai, Z. (2014). Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence (pp. 159–165). Quebec, Canada: AAAI. Wang, H H., Gibbins, N., Payne, T., & Patelli, A. (2013). A survey of semantic web services formalisms. 2013 Ninth International Conference on Semantics, Knowledge and Grids (SKG). IEEE Computer Society. Wang, J., Yan, Z., & Zuo, W. (2015). Word semantic similarity measurement based on naïve Bayes model. Journal of Computer Research and Development, 52(7), 1499–1509. Wei, C., Nie, H., & Cui, H. (2014). Construction of institutional repositories developed by multi-libraries collaboration: A case study of CALIS institutional repository project. Journal of University Library, 32(3), 69–73. Xiao-Wen, H., Ming, Y., Ji-Tao, S., & Chang-Sheng, X. U. (2016). Association rules mining based cross-network knowledge association and collaborative applications. Computer Science, 43(7), 51–56. Yanes, N., Ben Sassi, S., & Hajjami Ben Ghézala, H. (2015). A multidimensional semanticuser model for cots components search personalization. In Innovation Managementand Sustainable Economic Competitive Advantage, the 26th IBIMA Conference (pp. 2286–2295).