Available online at www.sciencedirect.com
ScienceDirect ScienceDirect
Procedia Computer Science 00 (2018) 000–000
Available online at www.sciencedirect.com Procedia Computer Science 00 (2018) 000–000
www.elsevier.com/locate/procedia
ScienceDirect
www.elsevier.com/locate/procedia
Procedia Computer Science 139 (2018) 56–63
The International Academy of Information Technology and Quantitative Management, the Peter Kiewit Institute,Technology University of Nebraska The International Academy of Information and Quantitative Management, the Peter Kiewit Institute, University of Nebraska
FIRDoR - Fuzzy information retrieval for document FIRDoR - Fuzzy information retrieval for document recommendation c a,b recommendation Rodrigo Costa dos Santos *, Maria Augusta Soares Machado
a
Rodrigo Costa dos Santosa,b*, Maria Augusta Soares Machado
c
UFF – Universidade Federal Fluminense – IC – Instituto de Computação. São Domingos, 24210-310, Niterói-RJ, Brazil b Instituto R. São José, 90 - Centro, 20010-020,São Rio Domingos, de Janeiro 24210-310, - RJ, Brazil Niterói-RJ, Brazil UFF – Universidade FederalINFNET, Fluminense – IC – Instituto de Computação. b cIBMEC, Av. Presidente Wilson, 118, 20030-020, Rio de Janeiro-RJ, Brazil Instituto INFNET, R. São José, 90 - Centro, 20010-020, Rio de Janeiro - RJ, Brazil
a
c
IBMEC, Av. Presidente Wilson, 118, 20030-020, Rio de Janeiro-RJ, Brazil
Abstract Abstract This paper presents FIRDoR, which is a recommended methodology for document retrieval using the Fuzzy Inference System, with the goal of converging area of of the user with that offor thedocument documents recovered. increasing number of This paper presents FIRDoR,thewhich is ainterest recommended methodology retrieval usingWith the the Fuzzy Inference System, documents available today on internet and in the that taskofofthe assisting usersrecovered. in finding With relevant informationnumber becomes with the goal of converging thethe area of interest of databases, the user with documents the increasing of very complex using today conventional methods to the difficulty and ranking results. Thisinformation study performs an documents available on the internet anddue in databases, the taskinofretrieving assisting users in finding relevant becomes experiment in order identify the needs of the user, areas ofininterest, andand the fuzzy classification approach terms ofana very complex usingtoconventional methods due to by thetheir difficulty retrieving ranking results. This study in performs semantic document to the make a recommendation. For areas the experiment, base 918 classification papers and 8 approach researchers experiment in order in to order identify needs of the user, by their of interest, aand theoffuzzy in were termsused of a and the results were in gratifying, degree of certainty in the recommendation of papers of importance for the user. semantic document order to with makea ahigh recommendation. For the experiment, a base of 918 papers and 8 researchers were used and the results were gratifying, with a high degree of certainty in the recommendation of papers of importance for the user. © 2018 The Authors. Published by Elsevier B.V. © 2018 The Authors. Published by Elsevier B.V. This is an open accessPublished article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) © 2018 The Authors. by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer review under responsibility ofthethe scientific committee of The International Academy of Information Technology and This is an open access article under CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer review under responsibility of the scientific committee of The International Academy of Information Technology and Quantitative Management, the PeterofKiewit Institute,committee University of of Nebraska. Peer review under responsibility the scientific International Academy of Information Technology and Quantitative Management, the Peter Kiewit Institute, University of The Nebraska. Quantitative Management, the Peter Kiewit Institute, University of Nebraska.
Keywords: fuzzy systems; information retrieval; semantic recommendation. Keywords: fuzzy systems; information retrieval; semantic recommendation.
1. Introduction 1. Introduction A central concept in Information Retrieval (IR) is the relevance of the result to the user. The process of A central concept in Information Retrieval (IR) is the relevance of the result to the user. The process of
* Corresponding author. Tel.: +55 21 98141-1594 E-mail address:
[email protected];
[email protected] * Corresponding author. Tel.: +55 21 98141-1594
E-mail address:
[email protected];
[email protected] 1877-0509 © 2018 The Authors. Published by Elsevier B.V. This is an open access under the CC by BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) 1877-0509 © 2018 The article Authors. Published Elsevier B.V. Peer review under responsibility of the scientific committee of The International Academy of Information Technology and Quantitative This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Management, the Peter Kiewit Institute, University of Nebraska. Peer review under responsibility of the scientific committee of The International Academy of Information Technology and Quantitative Management, the Peter Kiewit Institute, University of Nebraska. 1877-0509 © 2018 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer review under responsibility of the scientific committee of The International Academy of Information Technology and Quantitative Management, the Peter Kiewit Institute, University of Nebraska. 10.1016/j.procs.2018.10.217
2
Rodrigo Costa dos Santos et al. / Procedia Computer Science 139 (2018) 56–63 Santos, Rodrigo.C; Machado, Maria.A.S / Procedia Computer Science 00 (2018) 000–000
57
information retrieval is characterized by its level of uncertainty in assessment of usefulness of the documents by the result of the search. Consequently, it is more realistic to think of the probability of relevance than precise relevance. That being said, with the expectation of what is to be retrieved, the relevance will depend on the interaction with the system and is subject to and inherent in the judgment of the user. The classic Boolean model assumes that a document either matches the query or it does not; that is, there can be only two assessments: true or false. More refined models generate an ordering of the retrieved documents according to certain systems, one of which is the system of probability - documents are sorted according to the probability of relevance in a query.[1] Most of these sorting systems operate at the statistical and linguistic level, primarily by analyzing word frequency or by recognizing forms of the same word (stemming). This renders the process incomplete since it is unable to capture and make use of specific domain or context characteristics. [2] Specialists often come across the need to retrieve documents whose content is related to a specific field but, in most cases, fail to make such a selection with existing search engines due to the fact that the search engines overlook the area of interest of the user, causing frustration and wasted time. To achieve this objective, fuzzy logic [3] has proven to be an excellent tool since it processes uncertainties. The uncertainty in this case is the importance of specific data to the user. This paper proposes the use of a Fuzzy Inference System, in conjunction with an objective clustering technique, aiming at achieving a unified result. An experiment was carried out with professors and specialists in which their academic curriculum and scientific work, (specifically, papers in the area of computation), were analyzed. 2. Literature Review This section presents the main concepts on which the proposal of this work is based. Subsection 2.1 is a brief overview of the problem with retrieval of semantic information and demonstrates how it relates to this study. Subsection 2.2 is a review of the main concepts of Fuzzy Logic, which will then be incorporated with a methodology. And, subsection 2.3 is a general overview of a Lattes curriculum. [4] 2.1. Retrieval of Semantic Information With today’s explosion of data, information and cognizance in databases that are increasingly large and complex, users encounter new challenges of information retrieval. The recovery of textual information was traditionally based on what is called the BOW approach (bag of words), in which the documents are solely treated as collections of words associated with the occurrence of frequency. The disadvantage in the Boolean or vector retrieval models is, for example, factors relating to the way the words of a document relate to each other, to the way in which a collection of documents themselves relate to each other, or the information conveyed relates to the information that is already available to the user, play no role. In such models, the documents are completely decontextualized, and except for very simple procedures such as removing stop words and stemming, the words are regarded merely as charater strings, stripped of semantics. Various alternative approaches have been proposed to better explore the richness and complexity of the bases of information on which recovery is carried out, consequently improving the results obtained. Under the heading of the umbrella term, "Semantic Information Retrieval" - these approaches prove to be substandard - are expansion mechanisms based on tesauri, semantic annotation of unstructured text, use of ontologies, among others. Another approaches, which are being studied in order to add semantics to retrieval, use the classification of documents in groups or clusters and take into account the relationship between texts in the collection of documents. In this paradigm, the documents are not considered to be in decontextualized form. They are situated within a context and organized into clusters that can be classified and labeled from the lexical analysis of their content. All the same, this technique does not meet the needs of the user. This paper proposes to classify the results of a search into groups of relevant documents of pertinent recommendation, in a personalized form for each user, in order to improve the recovery performance.
Rodrigo Costa dos Santos et al. / Procedia Computer Science 139 (2018) 56–63 Santos, Rodrigo.C; Machado, Maria.A.S / Procedia Computer Science 00 (2018) 000–000
58
2.2.
3
Fuzzy Logic
Fuzzy logic or Fuzzy math was conceived by Lofti A. Zadeh, an electronical engineer and professor emeritus of computer science at the University of California, Berkeley and published in a paper entitled “Fuzzy Sets” in 1965. The Fuzzy Sets and Fuzzy Logic provide the basis for generating powerful problem solving techniques with wide applicability, especially in the areas of Control Engineering and decision-making [5]. The strength of Fuzzy Logic derives from its ability to infer conclusions and generate responses based on vague, ambiguous and qualitatively incomplete and imprecise information. In this respect, fuzzy systems have the ability to reason similarly to that of humans. Its performance is represented in a very simple and natural way, leading to the construction of understandable and easy-to-maintain systems.[6] In general, we can say that all people have, in some way, had contact with conventional Boolean logic. With type of logic, a certain statement is either true or false and there is nothing in between - binary logic. However, at times, statements involving only true or false do not make sense. In real life situations, it is difficult to rigidly include an element in only one group. For this reason, a fuzzy categorization specifies the degree of inclusion of a element of the data sample in a given group. For example, the statement, "Mark is tall because his height is 190cm." Is Mark being tall a totally true or totally false statement? Certainly not, because the statement depends on context. For example, if one takes into account the average height of a Brazilian, possibly "Mark is tall" is a true statement. However, in another context, for example, basketball players, perhaps this statement is false. According to Zadeh [3], a fuzzy subset of a set X is defined as any function u: X [0.1]; x X for each value of u (X) is the degree of belonging of x in a U subset. Thus, if instead of assuming discrete values in the interval "{0,1}" the function of belonging assumes continuous values in degrees "[0.1]" - set "A" is called a fuzzy set, in which each individual may belong partially to multiple sets. The value of u (X) is usually used to represent the degree or extent to which X is associated with the semantic description of u, and u (X) cannot be interpreted as the probability that X belongs to the class u but to what degree it belongs.
Fig. 1. Structure of a fuzzy system [7].
According to[7]: "nebulous models are usually called fuzzy systems and their general structure is composed of three main modules: fuzzification, inference and defuzzification", see Figure 1. It is possible to summarize the functions of the modules of a nebulous system as: Fuzzification: Transformation of quantitative information into qualitative information - a generalization process. Inference: Transformation of qualitative information into qualitative information - a conversion process. Defuzzification: Transformation of qualitative information into quantitative information - a specification process. 2.3. Lattes Curriculum Vitae
The Lattes curriculum is a standard curriculum for Brazilian professors and researchers, established by CNPq (Conselho Nacional de Desenvolvimento Científico e Tecnológico) a National Council for Scientific and Technological Development. With this standard, the researcher can register, in a structured way, professional and
4
Rodrigo Costa dos Santos et al. / Procedia Computer Science 139 (2018) 56–63 Santos, Rodrigo.C; Machado, Maria.A.S / Procedia Computer Science 00 (2018) 000–000
59
personal data, academic review, participation in conferences and juries and other related activities. This information is available to anyone with access to the Internet [4]. For the researcher, the use of the Lattes curriculum brings the advantage of dissemination of his work, providing access to research groups and participation in scientific projects. For institutions and universities, the use of the Lattes curriculum facilitates the search for researchers in finding others in their field of interest to form research groups, to evaluate a researcher’s work and to compare their work to others in the same field. 3. Methodology The methodology aims to facilitate the future construction of a tool for an automatic recommendation of scientific papers for researchers, based on their areas of interest obtained from their Lattes curriculum. With this objective, the methodology was developed in five stages, to be performed successively or independently, according to the interest of a future implementation. The first stage corresponds to the analysis of researcher’s Lattes curriculum. The algorithm developed in Java captures information in the researcher's curriculum. The parameter of the information is constructed in such a way that it could be substituted until obtaining the ones most suitable to the purpose of work. The information chosen was: Publication from Conference (subject of conference, title of paper, areas of expertise and keywords); Published Papers (name of journal, detail of paper, title of paper, keywords, areas of knowledge); and Participation on Juries (title, keywords and areas of expertise). The curriculum information was collected in Portuguese and English. Also, during this first stage, after the capture, the words present in the curriculum were subjected to formatting: use of stemming for the Portuguese and English, removal of stop words in Portuguese and English, TF-IDF calculation and finally storage in a MySQL database. The purpose of this step is the construction of the researcher's profile for recommendation purposes. One can update the information of the chosen researcher as well as add new profiles at any time, rendering the tool dynamic. In the second stage, a corpus of scientific papers was constructed. For a period of ten years, from 2007 to 2016, a selection of papers from conferences and journals related to Information Retrieval and Science Data was made and available in electronic format. Thus, three databases from papers were selected: a journal Springer International Publishing and two portal ACM Conferences - Association for Computing Machinery [10], as follows: Information Retrieval Journal (Springer) ISSN: 1386-4564; total of 274 papers. Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - ISSN 0163-5840; total of 326 papers. Proceedings of the ACM SIGMOD International Conference on Management of Data - ISSN 0163-5808; total of 387 papers. A total of 987 papers, of which 918 (93%) could be utilized for the experiment. The discarded papers presented quality problems. For example, they did not have keywords or an Abstract, the PDF file did not present textual elements nor could be read, some papers were incomplete, among other issues. The third stage relates to the construction of a vector-profile for each paper to be analyzed. This vector is constructed using formatting similar to that employed in the researcher's profile. That is, stemming for Portuguese and English, removal of stop words in Portuguese and English, TF-IDF calculation followed by the storage of the resulting words and TF-IDF correspondent in a MySQL database. The WVTool tool (WordVectortool), developed by Michael Wurst (University of Dortmund) was used.[8] As this tool was developed in Java and is made up of several classes that can be driven independently, it was possible to use it within the proposed programming in the development of an integrated algorithm that would perform all the necessary tasks required for the archiving of words after their preparation. The fourth stage aims to create FIS (FuzzyInferenceSystem) algorithms for execution in MatLab mathematical calculations [9] on each researcher and paper. It has been observed that the scale could be a topic of further study since the maximum values of TF-IDF varied
60
Rodrigo Costa dos Santos et al. / Procedia Computer Science 139 (2018) 56–63 Santos, Rodrigo.C; Machado, Maria.A.S / Procedia Computer Science 00 (2018) 000–000
5
from paper to paper and one researcher to another. Hence, in this study, a more flexible scale was used for papers and curricula, as well as the use of a trapezoidal function for the higher values ("very high") of the scale. The FIS algorithm developed is composed of two inputs and one output, shown in Figure 2: Input1 – Paper represents the TF-IDF value of the searched paper Input2 – CV represents the TF-IDF value of researcher’s Lattes curriculum Output1 – Recommendation represents the paper recommendation for the researcher This algorithm is executed for each of the words present in the paper that are also present in the researcher's curriculum.
Fig. 2. FIS generic recommendation for papers in Matlab.
It can be observed from the specification that this algorithm contains a set of 25 rules that will determine the recommendation, as shown in Table 1. Table 1. FIS Rules
Rule
TF-IDF-PAPER
TF-IDF-CV
1
Very High
Very High
Highly Recommended
Recommendation
2
Very High
High
Highly Recommended
3
Very High
Medium
Recommended
4
Very High
Low
Slightly Recommended
5
Very High
Very Low
not Recommended
6
High
Very High
Highly Recommended
7
High
High
Highly Recommended
8
High
Excellent
Recommended
9
High
Low
Slightly Recommended
10
High
Very Low
not Recommended
11
Average
Very High
Recommended
12
Average
High
Recommended
13
Medium
Medium
Medium
14
Medium
Low
Medium
15
Medium
Very Low
not Recommended
16
Low
Very High
Slightly Recommended
6
Rodrigo Costa dos Santos et al. / Procedia Computer Science 139 (2018) 56–63 Santos, Rodrigo.C; Machado, Maria.A.S / Procedia Computer Science 00 (2018) 000–000
17
Low
High
Slightly Recommended
18
Low
Medium
Slightly Recommended
19
Low
Low
not Recommended
20
Low
Very Low
not Recommended
21
Very Low
Very High
not Recommended
22
Very Low
High
not Recommended
23
Very Low
Average
No Recommended
24
Very Low
Low
Not Recommended
25
Very Low
Very Low
Not Recommended
61
The execution of this algorithm, word for word, resulted in an additional question that was the calculation of the final result of recommended of the paper. Two questions were raised: How to analyze this result? What better way to gather the results? In the fifth and final stage of this work, the goal was to determine the best way to jointly evaluate the results of individual recommendations. Several forms of grouping were used: simple average, summation, standard deviation, variance, average and harmonic mean. For the purpose of standardization, the simple average was performed on the same recommendation scale developed for individual word framing. The result obtained determines the average framing percentage in each of the relevant functions, the most recommended to least: Extremely Recommended, Very Recommended, Moderately Recommended, Slightly Recommended and Not Recommended. Just as in the fourth stage, a Java algorithm was constructed that generated the files necessary for the automatic execution of the process. 4. Results In order to the test the effectiveness of the proposed methodology, the Lattes curriculum of eight professors and researchers from the Information Retrieval computing area and Science Data, of large Brazilian universities, were used. For the automatic experiment and assuring validation of the result, the participating researchers that where of authorship, were identified in the database. This group of papers became the experimental group, since the expectation that these papers had a greater recommendation than others for the author researcher, given the high contextual alignment between the paper and the need of the researcher. Many of the papers in the experimental group appeared in the very Lattes curriculum of the researcher. Although the recommendation of a paper written by the researcher himself does not make sense in practice, when developing a tool for this purpose, this was the form found to validate the test carried out. The result observed was that the papers written by the researcher were recommended by the very researcher, with high relevance. That is, six out of eight researchers with relevance of "Extremely Recommended" - proving the initial hypotheses of the test. For the two other researchers, papers were commended as "Very Recommended", which demonstrated an equally satisfactory result. A point of great importance in the tests was the analysis of the ideal number of words to be evaluated. It should be noted that the more words are selected, the more the likelihood the paper will not be recommended. This is because many of the words present in the paper have no specific meaning for context of the paper. That is, words used for the description of the paper that are not necessarily part of the researcher's curriculum (eg. calculation, framing, average, sample, etc.). If a paper is extensive, it will probably contain several words to explain the subject which are not directly linked to the context analyzed, causing negative weight to the recommendation task. Thus, the identification of a number of words that was large enough for the characterization of the paper and small enough
Rodrigo Costa dos Santos et al. / Procedia Computer Science 139 (2018) 56–63 Santos, Rodrigo.C; Machado, Maria.A.S / Procedia Computer Science 00 (2018) 000–000
62
7
that the classification was not distorted by explanatory words, was of great concern during the development of the study. To minimize this problem, the weight artifice was used in the DF-ITF calculation. Weight 1 was assigned for the text of paper and weight 3 for the title, abstract and keywords (key-words) of the paper, since they represent its most significant words. As mentioned in the explanation of the methodology, the second point of special attention in this study was in relation to the scales used for the specification of Fuzzy tables. It was observed that in a highly relevant paper on a specific topic, the word with the highest TF-IDF reached 0.32, in another it reached 0.46 and 0.57 in another. Thus, using a scale ranging from 0 to 1 would frame the paper as slightly recommended or not recommended. We then imagined two alternatives as a way to work around this issue: the use of variable scales per researcher or the normalization of values. In this study, it was decided to maintain a fixed scale, with lower values in the upper limit and, as an additional solution, the use of a trapezoidal function in place, instead of a triangular, in order to improve the relevance of results to the fuzzy number. However, it was understood that the scale is an issue that should be further studied to assess the impact of its use in this type of work. Finally, a manual test was also carried out, which consisted of selecting 10 papers at random and one researcher out of the 8 to participate in the study. The researcher read the 10 chosen papers and graded a degree of interest for each of them. This result was confronted with an automatic recommendation generated by FIS and the result was found very satisfactory: 8 papers (80%) automatic recommendation converged with the opinion of the expert and in 2 papers (20%), the recommendation differed one degree on the scale, as illustrated in table 2. Paper 1 2 3 4 5 6 7 8 9 10
Table 2. Recommendation result
Degree of recommendation attributed by the Researcher Highly Recommended Highly Recommended Not Recommended Extremely Recommended Moderately Recommended Not Recommended Moderately Recommended Slightly recommended Not recommended Not recommended
Degree of recommendation attributed by the FIS system Highly Recommended Moderately Recommended Not Recommended Extremely Recommended Moderately Recommended Not Recommended Moderately recommended Slightly recommended Slightly recommended Not recommended
5. Conclusions This study presents a solution using fuzzy logic in recommending documents to the user, more specifically, automatic recommendation of papers for specialists, taking into account their interests and academic background. According to the topics presented, the objective was achieved; the FIS system provided a recommendation of papers very close to the scientific interests of the researchers, as shown by the results of the tests. This approach may be considered as very useful and applicable to other situations where semantic information retrieval is required. Following this line of reasoning, one can imagine the conception of a system that performs an automatic Internet search focused on the interests of user. Thus, the system would daily recommend a set of papers for each registered user without the need to search the internet for "news" in their field of interest. The following points are to be considered to regard to future study: The analysis of other parts of the Lattes curriculum or other sources of information, such as social networking, to better characterize and elaborate on the researcher’s area of interests, would be an interesting line of development.
8
Rodrigo Costa dos Santos et al. / Procedia Computer Science 139 (2018) 56–63 Santos, Rodrigo.C; Machado, Maria.A.S / Procedia Computer Science 00 (2018) 000–000
63
In this study, among the various forms of ranking for individual words, we chose the simple arithmetic average, since it was considered that, due the number of words selected, using the average mean could be a more accurate representation of the analyzed paper. However, a more detailed analysis of the ideal number of words would be another point to be developed in the future studies. The use of independent specifications - e.g., independently standardizing papers and curriculums should also be taken into consideration. Finally, the comparison or combination of methodology, described here, with others found in literature, such as use of tesauri thesauri, semantic annotation of unstructured text or the use of ontologies, could bring interesting developments to the result.
References [1] Baeza-Yates, Ricardo; Ribeiro-Neto, Berthier. Modern Information Retrieval. New York: Addison-Wesley, 1999. [2] Peter Ingwersen , Kalervo Järvelin, The Turn: Integration of Information Seeking and Retrieval in Context (The Information Retrieval Series), Springer-Verlag New York, Inc., Secaucus, NJ, 2005. [3] Zadeh, L. A. (1965) - “Fuzzy Sets”, Information and Control, 8: 338-353. [4] Plataforma Lattes – Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPQ) – Ministério da Ciência e Tecnologia – Brasil. Disponível em http://lattes.cnpq.br/conteudo/aplataforma.htm. [5] Barreto, Jorge M. Inteligência Artificial no Limiar do Século XXI. Florianópolis: Duplic, 2001 [6] Oliveira JR., Hime A.. “Lógica Difusa: Aspectos Práticos e Aplicações”. Rio de Janeiro: Interciência, 1999. [7] Bojadziev, George & Bojadziev, Maria. Fuzzy lógic for bussiness, finance and management. Singapore: World Scientific, 1997. [8] Wurst M. - The Word Vector Tool - User Guide - Disponível em: http://wvtool.sf.net/. [9] Matlab. Disponivel em: http://www.mathworks.com/. [10] Portal digital ACM - Association for Computing Machinery. disponível em: http://portal.acm.org/portal.cfm.