From keywords to relational database content: A semantic mapping method

From keywords to relational database content: A semantic mapping method

Information Systems xxx (xxxx) xxx Contents lists available at ScienceDirect Information Systems journal homepage: www.elsevier.com/locate/is From ...

2MB Sizes 0 Downloads 62 Views

Information Systems xxx (xxxx) xxx

Contents lists available at ScienceDirect

Information Systems journal homepage: www.elsevier.com/locate/is

From keywords to relational database content: A semantic mapping method ∗

Mariana Soller Ramada, João Carlos da Silva , Plínio de Sá Leitão-Júnior Universidade Federal de Goiás, Instituto de Informática, Alameda Palmeiras, 74001-970, Goiânia, Brazil

article

info

Article history: Received 12 May 2017 Received in revised form 16 November 2018 Accepted 4 October 2019 Available online xxxx Recommended by Gottfried Vossen Keywords: Keyword search Semantic mapping Context analysis Relational databases

a b s t r a c t Keyword-based query specification to extract data from structured databases has attracted considerable attention from various researchers, and many interesting proposals may be found in the scientific literature. However, many of these studies focus on finding a set of interconnected tuples containing all or some of the query keywords. The architecture introduced by this paper covers from the selection of databases on the Web to ranked relevant results. The approach also includes important aspects such as the proximity between keywords, query segmentation, and the use of aggregate functions, among others. The empirical evaluation analyzes the relevance of results and proves competitive as regards related studies. © 2019 Elsevier Ltd. All rights reserved.

1. Introduction The use of keywords for retrieving information consists of a simple search technique. The last decade has witnessed the growing use of this technique, which has in fact become a standard for user interaction with the World Wide Web (WWW). However, it cannot be applied to all storage media, like relational databases, which is a major form of data storage. Querying in relational databases requires prior knowledge of storage structures and of the syntax of a structured language, such as SQL. However, the majority of users do not have such knowledge, which limits access to the stored data. In addition, the databases available on the Web are not known by users and the selection of relevant ones may occur before query processing itself. In the last few years, great efforts have been made in research and development activities to extend the abilities of keywordbased search to data sources that follow the relational paradigm. Relevant issues include:

• The semantic analysis has to take into account the interdependence between keywords. Even though a query is made up of a simple list of keywords, the meaning of each keyword is not independent from the meaning of the others; together, they represent the concepts intended by the user when creating a query. • An appropriate query segmentation considers keywords’ joint mapping instead of assuming that each keyword represents a single role in the database. Consider the schema ∗ Corresponding author. E-mail address: [email protected] (J.C. da Silva).

of the relation Employee (Id, Name, Address, Salary, Super_id, Department_id). A possible interpretation for the query ‘‘employee Houston Tx’’ is ‘‘employees who live at the address Houston Tx’’. Therefore, the keywords Houston and Tx are expected to be mapped together for the attribute Address of the table Employee, instead of being mapped separately for a database structure. • Some keywords play special meaning in the query, e.g. summarizing (aggregation) functions, superlatives, etc., so that they cannot be mapped out for a corresponding database structure. Referring once again to the relation Employee, the query ‘‘highest salary’’ expects a statistic, the highest salary, as a result, not a set of interconnected tuples containing the keywords highest and salary. This paper focuses on a semantic approach to keyword queries, and a new keyword-based search method for relational databases was defined. The aspects listed above are taken into account and appropriate solutions are provided and implemented. For a given keyword query, the proposed method identifies relevant databases on the Web and converts it into corresponding SQL queries, all of which are submitted to the underlying databases. An SQL query is a structured query in which relations, attributes, and their conditions are accurately specified, whereas a keyword query comprises imprecise terms that express the user’s need for information. Therefore, this study introduces semantics to this conversion process, to provide a clearer idea of the meaning intended by the query and to construct SQL expressions that represent the user’s real intent, returning results in order of relevance. The remainder of this paper is organized as follows. Section 2 presents related work. Section 3 describes the architecture of the

https://doi.org/10.1016/j.is.2019.101460 0306-4379/© 2019 Elsevier Ltd. All rights reserved.

Please cite this article as: M.S. Ramada, J.C. da Silva and P. de Sá Leitão-Júnior, From keywords to relational database content: A semantic mapping method, Information Systems (2019) 101460, https://doi.org/10.1016/j.is.2019.101460.

2

M.S. Ramada, J.C. da Silva and P. de Sá Leitão-Júnior / Information Systems xxx (xxxx) xxx

proposed keyword query method, as well as the way it operates. Section 4 explains how the system works through a complete example. Section 5 provides an empirical evaluation of the proposed method and conclusions are presented in Section 6. 2. Related work This section focuses on relevant work to the approach discussed in this paper, mainly those that use words of a natural language to query and select databases. 2.1. Selection of relational databases Sayyadian et al. [1] state that the answer to a keyword-based query may not lie in a single database, but in multiple databases. Hence, it is necessary to combine tuples from several databases in order to obtain the desired response. They propose an algorithm which finds approximate joins between foreign keys from multiple heterogeneous databases. Hassan et al. [2] and Yu et al. [3] focus on the use of keyword queries on distributed and heterogeneous databases. In Hassan et al. [2], given a set of distributed databases and a query q, formed by keywords and logical operators, the proposed system identifies the databases that are most prone to return relevant results for query q and then searches only these databases. To assess the usefulness of each of them for query q, an inverted index is created, which stores database statistics and then uses them to represent how useful a database is for the query. Similarly, the purpose of the study by Yu et al. [3] is to identify relevant databases for a given keyword-based query. In Hassan et al. [2], however, the usefulness of a database is calculated only in terms of the frequency with which the keyword comes up in a database term, whereas Yu et al. [3] also take proximity into account. Silva et al. [4] have developed research focused on exposing relational databases on the Web. Each database has a special table which represents the extension of the database catalogue and contains information regarding database content. The present work selects relevant databases for a given query based on the method of exposing metadata from relational databases proposed in [4]. This information is here used to select the databases that are likely to return relevant results to the keyword query. 2.2. Keyword search in relational databases The literature reveals two main approaches to keyword queries in relational databases: one based on Candidate Networks and another based on Steiner Trees. Both conceive a database as a network of interconnected tuples and focus on detecting tuples that contain the keywords of a given query. Query processing returns connected components based on the way these tuples are associated. DBXplorer [5] and DISCOVER [6] implement the Candidate Networks approach, whereas BANKS [7] applies the Steiner Tree approach. The FRISK system [8] uses a dynamic programming algorithm to compute the query’s best segmentations and then present them to the user, who in turn chooses the one that better suits his/her intent. Keymantic [9] and Keyword++ [10] take into account the query’s ambiguity and address the completeness and accuracy of results. Completeness consists of returning all relevant results, whereas accuracy refers to the most relevant results which match user intent. A query may return more than one result, in which case it becomes necessary to order them according to their relevance. Ranking functions assign scores to each result and classify them based on their scores. Some researchers have employed Data

Retrieval metrics for calculating ranking functions, such as Luo et al. [11], Hristidis et al. [12] and Liu et al. [13]. In systems DISCOVER [6] and DBXplorer [5], results are ranked by simple methods, e.g. based on the number of joins. As for BANKS [7], calculating result scores takes into account the edge weight of the data graph. Other fields of research, such as those discussed by Li et al. [14] and Peng et al. [15], present feedback strategies that allow users dissatisfied with the set of initial results to perform further searches in an attempt to obtain better results. In the TASTIER system [16], the aim is to guide the user in query creation via self-complementation, whereas Koutrika et al. [17] seek to guide users and refine searches by using the abilities of summarizing and navigating data clouds. As regards the MeanKS System [18], the user’s interest is firstly captured by determining the roles of the keywords. Then, the system uses schema-based ranking to rank join trees that cover the keyword roles. Such ranking uses the relevance of relations and foreign-key relationships in the schema over the database’s information content. The DKS approach [19] copes with distributed keyword search processing on a cluster of servers over large databases in a parallel way. For that, the database is modeled as a connected and undirected data graph. Such a graph is then partitioned into multiple subgraphs without overlap and these are assigned to different servers. According to Bergamaschi et al. [20], keyword search in databases is becoming the technology of choice, even if it suffers from efficiency and effectiveness problems that prevent it from being adopted at Web scale. They emphasize the need for a reference architecture for keyword search in databases such as a companion evaluation framework. 2.3. Focus and contributions of this paper Bergamaschi et al. [21] contend that keyword search over relational databases is still a considerable challenge, despite the effort that the research community has put in the field in the last fifteen years and the significant number of scientific publications and prototypes developed. The authors discuss two main issues that have hampered the design and development of next generation systems for keyword search over structured data: (i) the lack of systemic approaches that account for all the issues of keyword search, from the interpretation of user needs to the computation, retrieval, ranking and presentation of the results; and (ii) the absence of a shared and complete evaluation methodology measuring user satisfaction, achieved utility and required effort for carrying out informative tasks. Not only do we endorse such an ideal scenario, we contend that keyword-based query approaches are evolving towards maturity, with a view to offering practical uses for the user community. This study focuses on the design and analysis of SQUIRREL (Semantic Keyword Query System for Information Retrieval from Relational Databases), whose contribution refers to an approach that ranges from the selection of databases on the Web that are relevant to user query to the presentation of a ranked list of results (see the architecture in Fig. 1). The innovative points also concern the following aspects, dealt with as a single approach:

• Database selection: It refers to the identification and selection of databases containing useful information for the query with keywords, by retrieving metadata exposed on the Web with respect to the content of relational databases. The method includes ranking such databases so that the most relevant will appear at the top of the list.

Please cite this article as: M.S. Ramada, J.C. da Silva and P. de Sá Leitão-Júnior, From keywords to relational database content: A semantic mapping method, Information Systems (2019) 101460, https://doi.org/10.1016/j.is.2019.101460.

M.S. Ramada, J.C. da Silva and P. de Sá Leitão-Júnior / Information Systems xxx (xxxx) xxx

3

Fig. 1. The architecture of SQUIRREL.

• Keyword query expansion by adding semantic-related words: Keywords are not always an exact representation of the terms in the database, so that words semantically related to the query keywords have to be considered. To this end, dictionaries (e.g. WordNet) are useful in providing synonyms, hyperonyms, hyponyms and other terms related to a given keyword. • Sorting and summarizing functions analysis: The identification and processing of keywords that suggest the use of these functions are carried out by applying an extensive list of reserved words. • Query processing considering the closeness between keywords: The relative positions of the words in the query based on their proximity impact query processing. The further away a mapped keyword is from an unmapped keyword, less influence it will exert on the mapping of that unmapped keyword. • Compound keyword analysis: Two or more consecutive keywords may be considered together to form the value of a single database attribute (i.e. multiple keywords that compose a single meaning), since they might represent a unique concept, and so they should be mapped jointly to the same attribute. 3. SQUIRREL – Semantic Keyword Query System Information Retrieval from Relational Databases

for

3.1. Architecture Fig. 1 shows an overview of the SQUIRREL architecture which is stated in four sequential phases (modules): query preprocessing, database selection, query mapping, and query processing. Each module has several components that execute system functions and services. Given a keyword-based query and databases available on the Web, SQUIRREL identifies the relevant databases for the query

and converts the keyword-based query into corresponding SQL queries, which are then submitted to the databases identified as relevant. In response to the keyword-based query, the system returns the results obtained by the execution of SQL queries. 3.2. Preprocessing This stage is responsible for identifying keywords in the query that do not provide a direct meaning in the database structure, but other forms of semantics, e.g. the use of aggregate functions and data sorting. In addition, this stage determines the query’s best segmentation regarding the values comprising more than one keyword. Some query keywords may not have a direct meaning in the database terms, such as relations, attributes, and attribute values. Let us take, for instance, the query ‘‘highest salary’’, mentioned earlier. The keyword salary could be mapped for the attribute Salary of the relation Employee, but the keyword highest has no representative element in the database for which it could be mapped. The best interpretation for this particular query is that the keyword highest represents the need to use the aggregate function max. Once this need has been identified, it is necessary to remove these keywords from the query so as to leave only the ones that have an actual meaning in the database and that will be mapped during the query processing stage. Yet another factor considered during preprocessing is query segmentation. Keywords which can be mapped for attribute values, i.e. for data must be given special attention during this stage. If the query is segmented to the extent that each keyword represents a given term of the database for which it will be mapped, values such as a product’s address and description may lose their meaning. Having said that, the values that may comprise more than one keyword must be taken into account during query segmentation, so that they may be mapped together.

Please cite this article as: M.S. Ramada, J.C. da Silva and P. de Sá Leitão-Júnior, From keywords to relational database content: A semantic mapping method, Information Systems (2019) 101460, https://doi.org/10.1016/j.is.2019.101460.

4

M.S. Ramada, J.C. da Silva and P. de Sá Leitão-Júnior / Information Systems xxx (xxxx) xxx

3.2.1. Summarize function/sorting This component allows the identification of keywords which represent the intention of using aggregate functions and sorting. Identifying keywords which suggest the use of aggregate functions and sorting is performed through a list of reserved words, which are basically predefined words based on the way users produce their queries. Seven groups were created and a set of keywords was defined for each of them. Each set consists of synonyms obtained by integrating a thesaurus as a semantic resource. Listed below are the groups and their respective sets:

• maximum = {highest, maximum, maximal, largest, greatest, • • • • • •

max} minimum = {lowest, minimal, minimum, smallest, least, shortest, min} mean = {average, mean, avg} sum = {total, sum} count = {quantity, amount, count} grouping = {for each, for all, grouped by} order = {order by, descending order by, ascending order by, sorted by, ranked by, classified by, organized by}1

After identifying which keyword suggests the use of an aggregate function or sorting, it is necessary to establish which attribute the function will be applied on. Based on the fact that users create queries in which related words are close to one another, once a keyword that suggests the use of an aggregate function or sorting is identified, the function it suggests is applied to the term represented by the keyword that immediately follows it. For instance, in the query ‘‘highest salary’’, the keyword highest is identified in the list of reserved words; this list shows that the keyword belongs to the predefined group of words that represent the aggregate function max. Because the keyword salary stands next to and follows the keyword highest, it will be the keyword on which the max function will be applied. Consider, for instance, that the keyword salary is mapped for the attribute Salary of the relation Employee. The query ‘‘highest salary’’ may then be converted into the SQL query select max(Salary) from Employee. Identifying the use of sorting follows the same line of reasoning. 3.2.2. Query segmentation In general, search engines support the use of delimiters to group multiple words to a single concept. Many concepts are represented by a phrase rather than by a single word. Google and Yahoo! are examples of search engines that allow syntax for phrase searching. Identifying keywords which must be regarded collectively to form an attribute value is performed by the use of single inverted commas. For instance, in the query ‘‘employee Houston TX’’, the keywords Houston and TX must appear within single inverted commas to represent their need to be mapped collectively for the same attribute domain. Therefore, the query must be ‘‘ employee ‘Houston TX’ ’’. 3.2.3. Query expansion Keywords are not always an exact representation of the word contained in the database. To deal with this problem, it is necessary to consider other words related to the keywords. Therefore, the WordNet dictionary [22], which provides synonyms, hypernyms, hyponyms, and other words related to a given keyword, was used in this study. 1 This group requires additional verification if the query contains the keyword descending, then the clause desc is used.

Table 1 Attributes of the Exposition Metadata Table (EMT) [4]. serial oai_set dc_subject dc_date dc_source dc_rights

provider dispquery dc_description dc_type dc_language loginbd

url dc_title dc_contributor dc_format dc_relation passwordbd

email dc_creator dc_publisher dc_identifier dc_coverage –

3.3. Database selection Information on the Web is available in various storage media, and such heterogeneity of digital repositories makes it difficult to retrieve information due to source dispersal, divergences in search interfaces, and lack of content integration, among other factors. 3.3.1. Exposing databases on the Web The Open Archives Initiative (OAI) [23] was created to develop and promote interoperability standards between digital repositories. For that, OAI created the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), which enables publicizing and sharing digital collections to facilitate effective content spread among these repositories. Publicizing and sharing digital collections are mechanisms widely used by digital libraries and websites. Nevertheless, publicizing and sharing database metadata have not yet been fully explored. To enable the search for information in relational databases on the Web, Silva et al. [4] propose a mechanism which allows homogeneous querying in digital libraries and databases by establishing a common set of metadata which are used to export (make available) these sources of information. To allow publicizing and sharing digital resources, OAI-PMH requires a standard metadata format. The standard used in digital libraries is Dublin Core [24]. In [4] the authors investigated the Dublin Core elements used to represent metadata for digital libraries, with a view to 1) defining a standard set of metadata to describe a database, in order to promote interoperability between databases and other information sources, such as digital libraries; and 2) allowing publicizing (making available) the content of these databases by means of a common protocol. Fifteen elements were preliminarily identified and 8 other elements were required for describing the metadata of relational databases, thus totaling 23 elements; these are summarized in Table 1. Element names beginning with dc_ represent elements which belong to the Dublin Core standard and are used in the context of digital libraries. Elements which do not begin with dc_ represent the additional elements proposed in this work to enable database description. Given the fact that these metadata are not available in the database catalog, the mechanism proposed by Silva et al. [4] suggests that they are supplied by the database administrator and stored in a table, which represents a catalog extension. This table is entitled Exposition Metadata Table (EMT) and contains an attribute for each of the elements described in Table 1. Once this table is defined, metadata are extracted and exported by OAI-PMH, hence allowing the identification and localization of databases available on the Web. With this mechanism, it is possible to query data from these information sources. 3.3.2. Database relevance analysis In accordance with the method of exposure of metadata from relational databases, proposed by Silva et al. [4], each database has an EMT which represents the extension of the database catalog and contains information regarding database content.

Please cite this article as: M.S. Ramada, J.C. da Silva and P. de Sá Leitão-Júnior, From keywords to relational database content: A semantic mapping method, Information Systems (2019) 101460, https://doi.org/10.1016/j.is.2019.101460.

M.S. Ramada, J.C. da Silva and P. de Sá Leitão-Júnior / Information Systems xxx (xxxx) xxx

The identification and selection of databases which have relevant content for keyword-based queries are performed according to the analysis of information contained in the EMT. Not all elements in this table contain information regarding database content. Some of them, e.g. login, url, and dispquery, refer to location and access. The identification stage considers only elements which contain or suggest information regarding database content. The elements considered are: dc_title, dc_subject, dc_description, and dc_identifier. If information concerning one of these four elements contains any of the query keywords, the database is considered relevant for the query. 3.3.3. Database ranking Given the set of databases that are relevant for the query, these are ranked so that the most relevant database tops the list. The relevance of a database in view of a query is computed according to the number of times the query keywords appear in the information of the four elements considered in the previous stage. 3.4. Query mapping Query mapping is carried out by identifying and analyzing potential associations between query keywords and database terms. The process requires understanding the meaning of keywords, both separately and jointly, and the construction of a database structured query (e.g. an SQL expression) that provides a coherent interpretation of the original query. Thus, it is necessary to map the keywords for the database structures, e.g. relations, attributes, and attribute values. Several techniques and tools have been proposed to solve the problem of keyword queries in relational databases, as was pointed out in Section 2. However, most of these proposals fail to consider the many interpretations a keyword query may pose. 3.4.1. Keymantic core The mapping process explores the relative positions of query keywords together with external sources, to produce a more accurate assumption of the semantics represented by the query. This statement is grounded on the fact that there may be interdependencies between query keywords. In other words, the meaning of each keyword is not independent from the meaning of the others; all of them collectively represent the concepts the user had in mind when creating the query. Moreover, not all keywords represent instance values. Many are used as metadata of adjacent keywords. The mapping process proposed in this paper is based on Keymantic [9], which uses a weight matrix consisting of a twodimensional array with one row for each keyword in the query and one column for each database term. The value of a cell represents the weight related to the mapping performed between keyword and database term. Two sub-matrices may be distinguished in the weight matrix. The first, called SW , corresponds to the database terms related to schema elements, i.e. relations and attributes. The second, called VW , corresponds to attribute values, i.e. elements which belong to attribute domains. The database terms related to schema elements are referred to as schema terms, whereas those related to attribute domains are referred to as value terms. Five steps are carried out in the mapping process: Step 1 Intrinsic Weight Computation of Schema and Value Database Terms: to compute intrinsic weights, it is necessary to calculate the relevance between each query keyword and each database term. The computation of schema terms is performed by exploring and combining a number of similarity techniques based on structural and lexical knowledge

5

extracted from the database, as well as on external knowledge, such as ontologies, thesauri, etc. Intrinsic value weights exploit domain information and represent the decision on whether a keyword belongs to the domain of an attribute or not. Step 2 Selection of Best Mappings for Schema Terms: a series of mappings are generated, based on the intrinsic weights of sub-matrix SW . Each mapping associates a certain number of keywords to the database schema terms. Keywords that remain unmapped are considered at a later stage during value term mapping. Only mappings that reach the highest score are selected. Step 3 Contextualization of VW and Selection of Best Mappings for Value Terms: based on each partial mapping generated in the previous stage, the unmapped keywords are now mapped. This mapping is carried out in two phases. In the first phase, each partial mapping generates an updated sub-matrix VW , so that the intrinsic weights of sub-matrix VW that were generated in Step 1 are updated to show the interdependence between unmapped keywords and the keywords already mapped for the schema terms. In the second phase, given an updated sub-matrix VW , mappings for the most prominent value terms of the keywords that remain unmapped are generated. Step 4 Generation of Configurations: a total mapping of keywords for the database terms is carried out, forming a configuration. The configuration’s score is the sum of weights in the weight matrix of elements [i, j], where i is a keyword and j is the database term to which the keyword was mapped. Step 5 Generation of Interpretations: once the best configurations have been computed, the interpretations of the keyword query, i.e. SQL queries, may be generated. The score of each SQL query is that of its respective configuration. A configuration is simply a mapping of the keywords to database terms. The presence of different join paths between these terms leads to multiple interpretations. 3.4.2. Synonym-based enriching Fig. 2 shows the algorithms related to the intrinsic weight computation of schema database terms. Fig. 2(a) presents the algorithm run for Keymantic and Fig. 2(b) reveals the code of the algorithm proposed in this study. During the intrinsic weight computation of schema database terms, the original algorithm employs a series of techniques that measure similarity between keywords and database terms, selecting the one which produces the best result. One of these techniques is string similarity, which uses a number of different similarity metrics. It also assesses the relationship between a given keyword and a database schema term based on their semantic relationship, and for that uses ontologies and dictionaries (Row 3 in Fig. 2(a)). For each measuring technique (string similarity, semantic relationship, etc.), the similarity between a keyword and a database term is calculated in a 0–1 interval. The highest value returned is then multiplied by 100 and selected as intrinsic weight (Row 16 in Fig. 2(a)). If none of the values returned is higher than a predefined threshold, then the weight is set to 0 (Rows 12–14 in Fig. 2(a)). The method here proposed initially considers only the string similarity technique. Three similarity metrics are used for this technique: Jaccard, Levenshtein, and Cosine (Row 2 in Fig. 2(b)). Each metric has two strings as input and returns the similarity between them in a 0–1 interval. Computing the intrinsic weight of a keyword and a database term is based on the average of similarities returned for each metric (Rows 7–10 in Fig. 2(b)). If the average is greater than the threshold, it is multiplied by 100 and selected as intrinsic weight (Rows 30–33 in Fig. 2(b)). If

Please cite this article as: M.S. Ramada, J.C. da Silva and P. de Sá Leitão-Júnior, From keywords to relational database content: A semantic mapping method, Information Systems (2019) 101460, https://doi.org/10.1016/j.is.2019.101460.

6

M.S. Ramada, J.C. da Silva and P. de Sá Leitão-Júnior / Information Systems xxx (xxxx) xxx

Fig. 2. Algorithm for intrinsic weight computation of schema database terms.

not, the similarity between the synonyms of the keyword and the database term will be calculated (Rows 11–19 in Fig. 2(b)). That involves using the WordNet2 dictionary to return the synonyms of each query keyword. The average of similarities returned for each metric is computed for each synonym found. The greater average value returned will be taken into account. If this value is greater than the threshold, the average is multiplied by 90 and selected as intrinsic weight (Row 26–29 in Fig. 2(b)). If it is lower than the threshold, then the intrinsic weight is set to 0 (Rows 24–25 in Fig. 2(b)). Since users normally expect keywords from the query to appear in the results, the reason for considering the use of a thesaurus at a later stage is that it is necessary to give greater relevance to keyword queries. In other words, the existence of the keyword in the database must have greater weight than the existence of its synonym. That is why the average value of similarities between synonyms is multiplied by 90, whereas the average value of query keywords is multiplied by 100 (Rows 28 and 32 in Fig. 2(b)).

2 http://wordnet.princeton.edu/.

3.4.3. Weight normalization The modification applied to the step Intrinsic Weight Computation of Value Database Terms is performed because computing value term weights is regarded in binary form. Therefore, whereas the weights of sub-matrix SW – which refer to schema terms – are found in a 0–100 interval, the weights of sub-matrix VW are 0 or 1. Thus, given that the weights of matrix VW are binary, the value attributed to the total configuration score by the keywords mapped to value terms is minimum. That being said, it is necessary to compare the value of value term intrinsic weights with that of schema term weights. Hence, in SQUIRREL, after calculating the intrinsic weights for the value terms, weights that have value 1 are multiplied by a predefined constant. 3.4.4. Closeness-based analysis Contextualization takes into account the interdependence between query keywords, in which mapping a given keyword may increase or reduce the probability that another keyword, as yet unmapped, corresponds to a given database term. In addition to considering the interdependencies between keywords, SQUIRREL takes into account their relative positions in the query based on their proximity. The farther a mapped keyword is from an unmapped one, the less influence it will exert over this word’s

Please cite this article as: M.S. Ramada, J.C. da Silva and P. de Sá Leitão-Júnior, From keywords to relational database content: A semantic mapping method, Information Systems (2019) 101460, https://doi.org/10.1016/j.is.2019.101460.

M.S. Ramada, J.C. da Silva and P. de Sá Leitão-Júnior / Information Systems xxx (xxxx) xxx Table 2 Partial EMT for Company database. Element

Information

serial url dispquery dc_title dc_creator dc_subject

1 https://gist.github.com/onzfonz/8b43f864bcc871dea124 yes Company Elmasri e Navathe Department. Dependent. Location. Employee. Project. Management. Salary. Working hours. Address. dc_descriptionRecords the staff of the company, which department they belong to, which department they manage, which project they are working on, the location of the department and project, employees’ dependents. dc_type Database - MYSQL dc_identifier bdcompany

7

Table 3 Partial EMT for Mondial database. Element

Information

serial url dispquery dc_title dc_creator dc_subject

2 https://dev.mysql.com/doc/sakila/en/ yes Sakila Oracle Actor. Address. Category. City. Country. Film. Language. Sale. Store. District. Phone. Postal code. Customer. Title. Description. Release year. Rental. Price. Manager. dc_description Database of a fictional video rental store containing movies, actors, customers, locations, etc. dc_type Database - MYSQL dc_identifier bdsakila

Table 4 Partial intrinsic weight SW matrix.

mapping process. This fact led to the following modification during contextualization: instead of adding a constant to intrinsic weights, the value added to the weight is proportional to the distance between keywords. 3.5. Query processing Query processing module receives as input a set of SQL queries grouped by mapping configuration, each of them with a corresponding configuration mapping score. 3.5.1. Query execution Each SQL query generated in the previous stage is now run in the corresponding database, and the results of these queries represent the results expected for a keyword query. 3.5.2. Ranking of results The results are not equally significant, as some of them represent the semantics intended by the query more effectively. In this regard, it is interesting to generate a ranking of results. Considering that the system computes a score for each SQL query generated, the results returned to the user may then be listed firstly by the position of their corresponding mapping configuration score, and secondly by the size of the join path, based on the number of joins required to create the SQL expression.

employees department name

D

E

D.Dna

D.Dnu

E.Fna

E.Min

E.Lna

P.Pna

0 100 0

89 0 0

0 0 83

0 0 0

0 0 83

0 0 0

0 0 83

0 0 83

Table 5 Partial intrinsic weight VW matrix. employees department name

D.Dna

D.Dnu

E.Fna

E.Min

E.Lna

P.Pna

65 65 65

0 0 0

65 65 65

65 65 65

65 65 65

65 65 65

Table 6 First best mapping. employees department name

D

E

D.Dna

D.Dnu

E.Fna

E.Min

E.Lna

P.Pna

0 100* 0

89* 0 0

0 0 83*

0 0 0

0 0 83

0 0 0

0 0 83

0 0 83

Table 7 First mapping generated from the first best mapping. employees department name

D

E

D.Dna

D.Dnu

E.Fna

E.Min

E.Lna

P.Pna

0 100* 0

−100

0 0 83*

0 0 0

0 0 83

0 0 0

0 0 83

0 0 83

0 0

4. A complete example Consider the database illustrated in Fig. 3, containing information about a company’s employees, departments and projects. Also consider the keyword query ‘‘quantity employees for each department order by name’’. During the preprocessing stage, the summarize function/ sorting component analyzes the keywords quantity, for each and order by and suggests the use of terms count, group by and order by, respectively. Such terms will not be mapped out for a corresponding database structure and therefore are deleted from the query, leaving only ‘‘employees department name’’. The query does not have single inverted commas, so the next component to be performed is query expansion. The query resulting from such component is ‘‘employees employee department section name figure gens’’. In the database selection phase, consider the existence of two databases, whose partial EMTs are presented in Tables 2 and 3. The Company database has a score of six, while the Sakila has a zero score. Therefore, only the Company database is relevant to the keyword query. Tables 4 and 5 show the intrinsic weights for SW (schema weight) and VW (value weight) matrices, respectively, generated in the query mapping stage. These numbers indicate whether

each query keyword can probably be mapped to a database schema term or considered as database values. Due to space constraints, not all relations and attributes are presented, only those that are necessary in the context of query mapping. Instead of full names, we use the first letter of table names and the first three letters of attribute names. Table 6 presents the first best mapping, whose score is 272 (sum of values indicated by *’s), i.e. the keywords employee, department and name are mapped, respectively, to relations Employee and Department, and to attribute Dname of Department. The algorithm does not stop after the generation of the first best mapping, but continues generating the second, the third, etc. A negative number is assigned to the individual mappings in order not to repeat them. Thus, from the first best mapping three additional mappings are generated, one for each mapped query keyword, as presented in Tables 7–9. Moreover, the algorithm only selects mappings whose score is greater than a certain threshold. The higher its value, the less mappings are generated at the end, but with higher confidence. In contrast, the lower the threshold value, the more mappings with lower confidence are generated. Considering a threshold of 90% for the first best mapping score, mappings in Tables 7 and 8 are rejected.

Please cite this article as: M.S. Ramada, J.C. da Silva and P. de Sá Leitão-Júnior, From keywords to relational database content: A semantic mapping method, Information Systems (2019) 101460, https://doi.org/10.1016/j.is.2019.101460.

8

M.S. Ramada, J.C. da Silva and P. de Sá Leitão-Júnior / Information Systems xxx (xxxx) xxx

Fig. 3. Fraction of a database schema with sample data. Table 8 Second mapping generated from the first best mapping. employees department name

D

E

D.Dna

D.Dnu

E.Fna

E.Min

E.Lna

P.Pna

0

89* 0 0

0 0 83*

0 0 0

0 0 83

0 0 0

0 0 83

0 0 83

−100 0

Table 9 Third mapping generated from the first best mapping. employees department name

D

E

D.Dna

D.Dnu

E.Fna

E.Min

E.Lna

P.Pna

0 100* 0

89* 0 0

0 0

0 0 0

0 0 83*

0 0 0

0 0 83

0 0 83

−100

Table 10 Mapping generated from mapping of Table 9. employees department name

D

E

D.Dna

D.Dnu

E.Fna

E.Min

E.Lna

P.Pna

0 100* 0

89* 0 0

0 0

0 0 0

0 0

0 0 0

0 0 83*

0 0 83

−100

−100

Table 11 Mapping generated from mapping of Table 10. employees department name

D

E

D.Dna

D.Dnu

E.Fna

E.Min

E.Lna

P.Pna

0 100* 0

89* 0 0

0 0

0 0 0

0 0

0 0 0

0 0

0 0 83*

−100

−100

−100

Similarly, new mappings can be generated from the mapping of Table 9. Those which have a score greater than the threshold will be considered, and the process repeats itself until it is not possible to generate mappings with a score above the threshold. As a result, four mappings (Tables 6, 9, 10 and 11) are selected and each is considered as a configuration, since all the keywords were mapped. If some keywords had not been mapped, it would be necessary to map them to value terms. Interpretations are generated from each configuration by considering the existence of different join paths between the mapped

relations. For configurations indicated in Tables 6, 9 and 10, there are three join paths between Employee and Department: two direct paths through the pair < primary key, foreign key > (< Dno, Dnumber > and < Ssn, Mgr_ssn >) and the path formed by the join Department-Project-Works_on-Employee. Thus, for each configuration, three SQL queries will be generated. For example, the three SQL queries (interpretations) obtained from the mapping of Table 6 are:

• select distinct count(employee.ssn), department.dname from department, employee where employee.ssn = department.mgr_ssn group by department.dnumber order by department.dname • select distinct count(employee.ssn), department.dname from department, employee where department.dnumber = employee.dno group by department.dnumber order by department.dname • select distinct count(employee.ssn), department.dname from department, employee, project, works_on where department.dnumber = project.dnum AND project.pnumber = works_on.pno AND employee.ssn = works_on.essn group by department.dnumber order by department.dname In Table 11, the keywords employees and department are mapped, respectively, to relations Employee and Department, and the keyword name is mapped to attribute Pname of Project, indicating the need to use data stored in relations Employee, Department and Project. There are four possible join paths between these relations, thus generating four interpretations:

• select distinct count(employee.ssn), project.pname from department, employee, project where project.dnum = department.dnumber AND employee.ssn = department.mgr_ssn group by department.dnumber order by project.pname

• select distinct count(employee.ssn), project.pname from department, employee, project where project.dnum = department.dnumber AND employee.dno = department.dnumber group by department.dnumber order by project.pname

Please cite this article as: M.S. Ramada, J.C. da Silva and P. de Sá Leitão-Júnior, From keywords to relational database content: A semantic mapping method, Information Systems (2019) 101460, https://doi.org/10.1016/j.is.2019.101460.

M.S. Ramada, J.C. da Silva and P. de Sá Leitão-Júnior / Information Systems xxx (xxxx) xxx

9

• select distinct count(employee.ssn), project.pname from department, employee, project, works_on where project. pnumber = works_on.pno AND employee.ssn = works_on. essn AND employee.ssn = department.mgr_ssn group by department.dnumber order by project.pname • select distinct count(employee.ssn), project.pname from department, employee, project, works_on where project. pnumber = works_on.pno AND employee.ssn = works_on. essn AND employee.dno = department.dnumber group by department.dnumber order by project.pname Once the SQL queries are generated, they are executed in the corresponding database and the results are sorted and returned to the user. Results are sorted primarily by the score of the corresponding interpretation and secondarily by the size of the join path. As all the interpretations considered in this example have the same score (272), results will be sorted by the size of the join path. Therefore, the results of the interpretations that have the smallest join path will appear first to the user. 5. Approach assessment As a baseline for the empirical assessment of SQUIRREL, we considered the implementation of Keymantic core component currently available, called Keymantic2.jar.3 However, we have verified that such implementation has functional differences in relation to the original Keymantic method as described in [25]. A configuration is simply a mapping of the keywords to database terms, and the presence of different join paths among these terms results in multiple interpretations. The main difference refers to the set of interpretations generated for each configuration. The Keymantic2.jar simply does not explore all the interpretation domains, as it creates a single interpretation for each configuration, and in most queries such interpretation is not the most relevant one in terms of user intent. For this reason, we implemented the Keymantic method based on information provided by the paper that introduces the tool [25]. The Company database was used in the empirical analysis, whose schema size is 208 KB. Table 12 shows the set of keyword queries selected from a query pool and its user intended semantics. Fig. 4 compares and contrasts the number of configurations and interpretations (SQL queries) generated by both systems. Values are shown for each of the queries, whose identification is the same as that used in Table 12. Internal modifications, performed during the first three stages of Keymantic’s mapping process, influenced the results obtained by our method, leading to lower values in comparison with Keymantic as regards the number of configurations and interpretations. This reduction has a major impact on the cost of the process and allows the user to deal with less results for his/her query. The graphs in Fig. 5 explore queries with three or more keywords, as well as present each system’s relative accuracy. This metric reveals the relevant fraction of the interpretations obtained, hence its status as a quality indicator commonly described in the literature. As shown in Fig. 5(a), SQUIRREL was more effective in all queries. Fig. 5(b) exhibits accuracy values which refer to query size, i.e. the number of keywords comprising the query. Accuracy decreases in both systems as the number of query keywords increases. However, results show, once more, better accuracy values when compared with Keymantic. As regards the external stage Ranking of Results, Keymantic is known to show results according to the order in which they are 3 Available in https://sourceforge.net/projects/keymantic/files/.

Fig. 4. Number of configurations and interpretations generated by both systems.

generated, with no guarantee that they will be presented in a descending score order. The metric Mean Reciprocal Rank (MRR) was used to assess the stage Ranking of Results. This metric measures how near the top the first relevant result ranks within the set of results. Fig. 6 shows the MRR for both systems based on query length. In summary, SQUIRREL shows satisfactory results using MRR. In Keymantic, the longer the query, the farther from the top of the ranking the first relevant result appears. This also occurs in SQUIRREL, but not as radically; in other words, even though the first result distances itself from the top as the query becomes longer, it remains one of the top-ranking results. 5.1. Threats to validity Three limitations were identified in the results obtained with the proposed method. The first limitation is related to the cardinality of the set of queries from which the results were obtained. A more extensive set of queries could lead to a higher confidence level when comparing results with Keymantic. The second limitation is related to how the queries were obtained, which are designed for the analysis of the proposed method instead of queries obtained from real systems. The last limitation concerns the fact that the proposed method was compared solely with Keymantic. A stronger argument could be made in favor of the proposed approach if its performance were to be compared with other existing stateof-the-art tools that allow searching relational databases using keyword queries.

Please cite this article as: M.S. Ramada, J.C. da Silva and P. de Sá Leitão-Júnior, From keywords to relational database content: A semantic mapping method, Information Systems (2019) 101460, https://doi.org/10.1016/j.is.2019.101460.

10

M.S. Ramada, J.C. da Silva and P. de Sá Leitão-Júnior / Information Systems xxx (xxxx) xxx Table 12 Set of keyword queries. Keyword queries

Intended semantics

1

department

List of information regarding the company’s departments

2

department employee

List of each department’s employees

3

department project

List of each department’s projects

4

department project newbenefits

Information of the department responsible for the Newbenefits project

5

employee dependent daughter

Information regarding employees who have a daughter as a dependent

6

project name employee John

Name of the project for which employee John works

7

employee address project 1

Address of employee who works in project 1

8

hours employee works project productz

Number of working hours of each employee in project ProductZ

9

project location employee salary 30 000

Location of project for which the employee who earns a 30 000 salary works

Fig. 6. MRR for both systems (inspired in [26]).

Fig. 5. Accuracy metric for both systems.

6. Conclusion This paper proposed a method for querying relational databases with keywords to simplify access to these data, given the fact that such queries use natural-language words. This method considered factors such as query segmentation, aggregate functions and sorting, as well as the user’s intended semantics when creating the query.

In order to offer new querying possibilities, SQUIRREL implements functionalities that enable the specification of queries such as those that employed aggregate functions, sorting, and values comprising more than one keyword. Moreover, during the experiments, SQUIRREL produced a smaller and more significant set of results for the keyword queries submitted when compared with other existing state-of-the-art tool that allow searching relational databases using keyword queries. Along the course of this research, we identified some aspects that could be complemented on to further discussion on this topic. Among them is the use of ontologies. During the process of intrinsic weight computation of schema terms, ontologies can be used in addition to the synonyms obtained by WordNet, hence adding greater semantics to these stages. Another aspect to be considered is more effective query segmentation. In the proposed method, composite values are identified via the use of single inverted commas, which requires the user’s a priori knowledge to construct the query in a suitable way. Lastly, it would be interesting to extend the ideas introduced by this study to relational database queries via natural language, which may be regarded as a broader keyword query. Acknowledgments The author João Carlos da Silva is grateful to the support given by CNPq, Brazil (Science Without Borders program) through an individual scholarship.

Please cite this article as: M.S. Ramada, J.C. da Silva and P. de Sá Leitão-Júnior, From keywords to relational database content: A semantic mapping method, Information Systems (2019) 101460, https://doi.org/10.1016/j.is.2019.101460.

M.S. Ramada, J.C. da Silva and P. de Sá Leitão-Júnior / Information Systems xxx (xxxx) xxx

References [1] M. Sayyadian, H. Lekhac, A. Doan, L. Gravano, Efficient keyword search across heterogeneous relational databases, in: 2007 IEEE 23rd International Conference on Data Engineering, IEEE Computer Society, Los Alamitos, CA, USA, 2007, pp. 346–355, http://dx.doi.org/10.1109/ICDE.2007.367880. [2] M. Hassan, R. Alhajj, M.J. Ridley, K. Barker, Simplified access to structured databases by adapting keyword search and database selection, in: Proceedings of the 2004 ACM Symposium on Applied Computing, SAC ’04, ACM, New York, NY, USA, 2004, pp. 674–678, http://dx.doi.org/10.1145/967900. 968042. [3] B. Yu, G. Li, K. Sollins, A.K.H. Tung, Effective keyword-based selection of relational databases, in: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD ’07, ACM, New York, NY, USA, 2007, pp. 139–150, http://dx.doi.org/10.1145/1247480.1247498. [4] J. Da Silva, E. Kowata, A. Vincenzi, Extracting and exposing relational database metadata on the web, in: Proceedings of the IADIS International Conference WWW/Internet, ICWI 2012, 2012, pp. 35–42. [5] S. Agrawal, S. Chaudhuri, G. Das, Dbxplorer: A system for keyword-based search over relational databases, in: Proceedings of the 18th International Conference on Data Engineering, 2002, pp. 5–16, http://dx.doi.org/10.1109/ ICDE.2002.994693. [6] V. Hristidis, Y. Papakonstantinou, Discover: Keyword search in relational databases, in: P.A. Bernstein, Y.E. Ioannidis, R. Ramakrishnan, D. Papadias (Eds.), VLDB ’02: Proceedings of the 28th International Conference on Very Large Databases, Morgan Kaufmann, San Francisco, 2002, pp. 670–681, http://dx.doi.org/10.1016/B978-155860869-6/50065-2. [7] B. Aditya, G. Bhalotia, S. Chakrabarti, A. Hulgeri, C. Nakhe Parag, S. Sudarshanxe, BANKS: Browsing and keyword searching in relational databases, in: P.A. Bernstein, Y.E. Ioannidis, R. Ramakrishnan, D. Papadias (Eds.), VLDB ’02: Proceedings of the 28th International Conference on Very Large Databases, Morgan Kaufmann, San Francisco, 2002, pp. 1083–1086, http: //dx.doi.org/10.1016/B978-155860869-6/50114-1. [8] K. Pu, X. Yu, Frisk: Keyword query cleaning and processing in action, in: IEEE 25th International Conference on Data Engineering (ICDE), 2009, pp. 1531–1534, http://dx.doi.org/10.1109/ICDE.2009.139. [9] S. Bergamaschi, E. Domnori, F. Guerra, M. Orsini, R.T. Lado, Y. Velegrakis, Keymantic: semantic keyword-based searching in data integration systems, Proc. VLDB Endow. 3 (2010) 1637–1640, http://dx.doi.org/10.14778/ 1920841.1921059. [10] V. Ganti, Y. He, D. Xin, Keyword++: a framework to improve keyword search over entity databases, Proc. VLDB Endow. 3 (2010) 711–722, http: //dx.doi.org/10.14778/1920841.1920932. [11] Y. Luo, W. Wang, X. Lin, Spark: A keyword search engine on relational databases, in: IEEE 24th International Conference on Data Engineering (ICDE), 2008, pp. 1552–1555, http://dx.doi.org/10.1109/ICDE.2008.4497619. [12] V. Hristidis, L. Gravano, Y. Papakonstantinou, Efficient ir-style keyword search over relational databases, in: Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29, VLDB ’2003, VLDB Endowment, 2003, pp. 850–861, http://dx.doi.org/10.1016/B978-0127224428/50080-X. [13] F. Liu, C. Yu, W. Meng, A. Chowdhury, Effective keyword search in relational databases, in: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, SIGMOD ’06, ACM, New York, NY, USA, 2006, pp. 563–574, http://dx.doi.org/10.1145/1142473.1142536.

11

[14] F.-Z. Li, D. Luo, D. Xie, Fuzzy Search on non-numeric attributes of keyword query over relational databases, in: Computer Science Education, 2009. ICCSE ’09. 4th International Conference on, 2009, pp. 811–814, http://dx. doi.org/10.1109/ICCSE.2009.5228162. [15] Z. Peng, J. Zhang, S. Wang, C. Wang, Bring user feedback into keyword search over databases, in: Sixth Web Information Systems and Applications Conference (WISA), 2009, pp. 210–214, http://dx.doi.org/10.1109/WISA. 2009.29. [16] G. Li, S. Ji, C. Li, J. Feng, Efficient type-ahead search on relational data: a TASTIER approach, in: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, SIGMOD ’09, ACM, New York, NY, USA, 2009, pp. 695–706, http://dx.doi.org/10.1145/1559845.1559918. [17] G. Koutrika, Z.M. Zadeh, H. Garcia-Molina, Data clouds: summarizing keyword search results over structured data, in: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT ’09, ACM, New York, NY, USA, 2009, pp. 391–402, http://dx.doi.org/10.1145/1516360.1516406. [18] M. Kargar, A. An, N. Cercone, P. Godfrey, J. Szlichta, X. Yu, Meanks: Meaningful keyword search in relational databases with complex schema, in: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, ACM, New York, NY, USA, 2014, pp. 905–908, http://dx.doi.org/10.1145/2588555.2594533. [19] Z. Yu, X. Yu, Y. Chen, K. Ma, Distributed top-k keyword search over very large databases with mapreduce, in: 2016 IEEE International Congress on Big Data (BigData Congress), 2016, pp. 349–352, http://dx.doi.org/10.1109/ BigDataCongress.2016.55. [20] S. Bergamaschi, N. Ferro, F. Guerra, G. Silvello, Keyword-based search over databases: A roadmap for a reference architecture paired with an evaluation framework, in: Transactions on Computational Collective Intelligence XXI - Volume 9630, Springer-Verlag New York, Inc., New York, NY, USA, 2016, pp. 1–20, http://dx.doi.org/10.1007/978-3-662-49521-6_1. [21] S. Bergamaschi, D. Beneventano, F. Mandreoli, R. Martoglia, F. Guerra, M. Orsini, L. Po, M. Vincini, G. Simonini, S. Zhu, L. Gagliardelli, L. Magnotta, From data integration to big data integration, in: A Comprehensive Guide Through the Italian Database Research over the Last 25 Years, Springer International Publishing, 2018, pp. 43–59, http://dx.doi.org/10.1007/9783-319-61893-7_3. [22] C. Fellbaum, WordNet: An Electronic Lexical Database, Bradford Books, 1998. [23] A.W.M. Foundation, Open archives initiative, 2018, http://www. openarchives.org/. (Accessed 16 November 2018). [24] A.W.M. Foundation, Dublin core metadata initiative, 2018, http:// dublincore.org/. (Accessed 16 November 2018). [25] S. Bergamaschi, E. Domnori, F. Guerra, R. Trillo Lado, Y. Velegrakis, Keyword search over relational databases: a metadata approach, in: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD ’11, ACM, New York, NY, USA, 2011, pp. 565–576, http: //dx.doi.org/10.1145/1989323.1989383. [26] S. Fakhraee, F. Fotouhi, Dbsemsxplorer: semantic-based keyword search system over relational databases for knowledge discovery, in: Proceedings of the Third International Workshop on Keyword Search on Structured Data, KEYS ’12, ACM, New York, NY, USA, 2012, pp. 54–62, http://dx.doi. org/10.1145/2254736.2254748.

Please cite this article as: M.S. Ramada, J.C. da Silva and P. de Sá Leitão-Júnior, From keywords to relational database content: A semantic mapping method, Information Systems (2019) 101460, https://doi.org/10.1016/j.is.2019.101460.