Fuzzy quality-Aware queries to graph databases

Fuzzy quality-Aware queries to graph databases

Fuzzy Quality-Aware Queries to Graph Databases Journal Pre-proof Fuzzy Quality-Aware Queries to Graph Databases ´ Olivier Pivert, Etienne Scholly, G...

716KB Sizes 0 Downloads 130 Views

Fuzzy Quality-Aware Queries to Graph Databases

Journal Pre-proof

Fuzzy Quality-Aware Queries to Graph Databases ´ Olivier Pivert, Etienne Scholly, Gregory Smits, Virginie Thion PII: DOI: Reference:

S0020-0255(20)30110-9 https://doi.org/10.1016/j.ins.2020.02.035 INS 15219

To appear in:

Information Sciences

Received date: Revised date: Accepted date:

28 May 2019 20 December 2019 9 February 2020

´ Please cite this article as: Olivier Pivert, Etienne Scholly, Gregory Smits, Virginie Thion, Fuzzy Quality-Aware Queries to Graph Databases, Information Sciences (2020), doi: https://doi.org/10.1016/j.ins.2020.02.035

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier Inc.

Fuzzy Quality-Aware Queries to Graph Databases Olivier Piverta , Etienne Schollyb , Gr´egory Smitsa , Virginie Thiona,∗ a Univ

Rennes, IRISA - UMR 6074, F-22305 Lannion, France {olivier.pivert,gregory.smits,virginie.thion}@irisa.fr b Universit´ e de Lyon, Lyon 2, ERIC EA 3083, France [email protected]

Abstract Graph databases have aroused a large interest in the last years due to their large scope of potential applications (e.g., social networks, biomedical networks, data stemming from the web). However, much published data suffer from quality problems, and graph data are no exception. In this paper, we investigate the issue of dealing with quality information in graph databases, at querying time. A framework is provided that makes it possible to introduce fuzzy quality preferences into graph pattern queries. This question is answered first from a theoretical point of view and then with an application to the Neo4j database management system by the extension of the cypher query language, for which implementation issues are discussed. Keywords: Quality, Graph database, Flexible querying, Fuzzy logic

Introduction Much published data suffer from quality issues [1]. Some data may be incomplete, obsolete, inaccurate. It is now well-recognized that data quality problems are endemic and may lead to severe consequences. So, managing the quality of 5

data is a necessary condition to the suitability of most information systems [2]. ∗ Corresponding

author Email address: [email protected] (Virginie Thion)

Preprint submitted to Information Sciences

February 10, 2020

A lot of work has been done about data quality management in relational databases (see e.g. [3] and [4]). However, even though relational databases are still widely used, the need to handle complex data has led to the emergence of other types of data models. In the last few years, graph databases have started 10

to attract a lot of attention in the database world (see e.g. [5], [6], [7], [8] and [9]). Their basic purpose is to manage networks of entities, the underlying data model of many open data applications like e.g. social networks, biological or bibliographic databases. The graph data model raises new challenges in terms of data quality management.

15

Literature proposed a wide range of metrics that make it possible to measure data quality for existing data models including graph-based ones (see e.g. [3] and [1]). These metrics are used to detect quality problems in data and to measure their level of quality. But, assuming that data metrics are available, questions still arise: “How to formalize and link these quality measurements

20

within the data model?” and, “How to take quality information into account at query time?”.

These are the two problems addressed in this paper for which novel contributions are given. The framework first introduces a formalism to integrate quality 25

measurements into an attributed graph data model. Graph pattern matching, that is the central theoretical notion to graph querying, is then extended to allow for the definition of fuzzy quality preferences into queries. Then a concrete implementation of this framework is described to illustrate the approach. This implementation relies on an extension of the cypher query

30

language [10], i.e., the query language used by the Neo4j graph database management system [11]. The remainder of the paper is organized as follows. First, Section 1 presents a review of related work of the literature. Section 2 presents some background notions about graph databases and the way such databases can be queried (in

35

particular by means of fuzzy preference queries). Section 3 defines a generalized graph data model, which extends the classical attributed graph data model so 2

as to embed quality information. Then, Section 4 defines the extension of the graph pattern query notion, with the objective of enabling the specification of fuzzy preferences about quality in queries. Section 5 deals with the evaluation 40

of a graph pattern query and the related cost. Section 6 presents a concrete implementation of the theoretical notions introduced in Sections 3 and 4. The Neo4j cypher query language is extended according to the framework proposed. Relevant implementation issues are discussed. Section 7 recalls the main contributions and outlines a few research perspectives.

45

1. Literature Review The framework proposed in this paper is situated at the intersection between quality management on the one hand and fuzzy querying to graph data on the other hand. Quality modeling in non-relational databases. The first type of related work

50

concerns quality modeling in the context of linked data and more precisely RDF databases. Works done in the field of linked data quality measurement are abundant, thus leading to a huge number of quality metrics [12, 13]. In the present work, it is considered that some quality metrics are available and that these metrics follow the quality vocabulary normalized by the W3C [14].

55

This vocabulary allows attaching quality measures and annotations to RDF data graphs. As far as quality measurement is concerned — which is the type of information targeted in this paper —, the W3C vocabulary defines labels for edges connecting quality annotations to the nodes concerned. A part of this vocabulary is adapted in this work to fit with the specificities of the attributed

60

graph model, although the general principle is kept as it still consists in connecting data nodes to quality ones. However, as shown in Section 3, attaching quality information to attributes in the attributed graph model implies a rather complex modeling effort and is not a straightforward task. Another close contribution is [15]. In this work, the authors consider the

65

attributed graph data model and propose a framework that makes it possible 3

to attach quality annotations (but not quality metrics values), at the node (or subgraph) level. The quality model proposed in the present work is more refined and expressive as the annotations may concern edges or attributes. The authors of [15] do not consider flexible queries exploiting quality information 70

either. Quality measurement within RDF data has been considered for various tasks, from data sources integration [16] to data summarization [17], but never to fuzzy querying with the objective of letting users integrate quality aware conditions within their queries. Fuzzy queries to graph databases. Several frameworks have been proposed to

75

allow users define, through graphical interfaces, crisp conditions on the quality level of the data they are seeking [18, 19]. The approach detailed in this paper goes further as it provides both an extension of the attributed graph format to handle quality measurements and a formal extension of graph query patterns. In [20], the authors define a framework for expressing fuzzy queries to at-

80

tributed graph databases that has then been extended with quantifier-based graph patterns [21]. They define a fuzzy extension of the language Cypher that makes it possible to express preferences both on the values of the nodes and the structure of the graph, but no quality information is considered (no “qualityaware” model is defined). In [22], fuzzy querying functionalities are added into

85

the Cypher language but again without any data quality dimension. Whereas a lot of approaches are devoted to the quantification of data quality and the construction of a taxonomy of such quality metrics, the question of managing these “metadata” as close as possible to the data themselves is studied in this paper. To reach this goal, a thorough formalization is first provided to

90

integrate quality measurements in the attributed graph data model. Then, a formal extension of the query graph pattern is provided to allow users express flexible queries, using the Cypher language, that both integrate conditions on the data and their quality “metadata”. To the best of our knowledge, this is the first framework addressing the issue of data quality management from data

95

modeling to user querying in a flexible way.

4

2. Background Notions 2.1. Graph Databases A graph database management system enables managing data for which the structure of data is modeled as a graph (nodes are entities and edges are relations 100

between entities), Such data are handled through graph-oriented operations and type constructors [5]. Among the existing systems, let us mention AllegroGraph [23], InfiniteGraph [24], Neo4j [11] and Sparksee [25]. Different models have been proposed for graph databases (see [5] for an overview), including the attributed graph (aka. property graph) aimed to model a network of entities with embedded

105

data. In this model, nodes and edges may contain data in attributes (aka. properties). Ron Howard

James Thompson follows

{born: 1954}

directed

Jessica Thompson

reviewed

directed

reviewed

The De Vinci Code

Apollo 13 {released: 1995}

{released: 2008}

acted in {role : Jill Lovell} acted in

Bill Paxton {born: 1955}

acted in

acted in {role : Sophie Neveu}

Audrey Tautou

Tom Hanks

reviewed

acted in

{born:1956, bornAt:Concord,US}

Cloud Atlas {released: 2012}

directed acted in

acted in

acted in

A League Of Their Own

That Thing We Do

{released: 1992}

{released: 1997}

Figure 1: A graph database G

In the following, we assume the existence of pairwise disjoint sets: two sets of nodes V (data nodes) and VQ (data quality nodes), a set E of labels, and a set A of attribute names (the domain of an attribute a ∈ A is denoted by 110

Dom(a)). Let us first recall the definition of a data graph. 5

Definition 1. A data graph G is a pair (V, R), where • V ⊆ V is a finite set of nodes and • R = {re | e ∈ E and re ⊆ V × V } is a set of sets of labeled edges between nodes. 115

The attributed graph notion extends the data graph one by making it possible to embed attributes (key-values pairs) in nodes and edges. Definition 2. An attributed graph G is a quadruple (V, R, ζV , ζR ) where • (V, R) is a data graph, • ζV = {ζVa | a ∈ A and ζVa : V → Dom(a)} is a set of partial functions that

120

assign attribute values to nodes and • ζR = {ζra | a ∈ A and r ∈ R and ζra : Dom(r) → Dom(a)} is a set of partial functions that assign attribute values to edges. Let us illustrate the attribute assignation aspect over the attributed graph of Figure 1. The following instantiation models the value 1995 attached to the attribute relased of the node Apollo 13: ζVrelased (Apollo 13) = 1995. The following instantiation models the value Sophie Neveu attached to the attribute role in the edge that connects Audrey Tautou to The De Vinci Code through the acted in relation: (Audrey Tautou, The De Vinci Code) = Sophie Neveu. ζrrole acted in In the following, an attributed graph is simply called a data graph. Example 1. Figure 1 is a simple example of a graph database, denoted by G in

125

the following, which is used throughout the paper as a running example. This

6

graph contains nodes that describe actors and films. Edges model relationships between nodes: for instance, the acted in relation connects actors to the films they played in. For the sake of readability, the nodes of the graph are identified by the name or title of the entity they model. In practice, this information can 130

be embedded in attributes (this does not impact the framework proposed in the following).  2.2. Graph Database Fuzzy Querying A graph pattern query is classically defined as a graph where variables and conditions can occur. Let Vn and Ve be distinct sets of node variables and edge variables respectively. A graph pattern query is a tuple of the form: (Vquery , Rquery , σn , σe ), where • Vquery ⊆ V ∪ Vn is the set of nodes of the graph pattern,

135

• Rquery = {re | e ∈ E ∪ Ve and re ⊆ Vquery × Vquery } is the set of edges of the graph pattern, • σn (resp. σe ) denotes a (possibly compound) Boolean condition over attribute values of the elements from Vquery (resp. Rquery ), for instance a1 .nat = 0 en0 ∧ a2 .born > 1880, where a1 and a2 are author nodes

140

(a1 , a2 ∈ Vquery ). p2

tom where tom.name=Tom Hanks acted in

r : acted in

m Figure 2: Graph pattern PTom

Example 2. Figure 2 is a graph pattern query, denoted by PTom , that aims to retrieve the actors who played in a film with Tom Hanks.  7

The result [[P]]G of a pattern query P over a graph G is a set of matching subgraphs {g ∈ P(G) | g “matches” P}. A subgraph g “matches” P iff there 145

exists a homomorphism h from nodes and labels of P to g and each node h(v) (resp. edge h(l)) satisfies its associated conditions in σn (v) (resp. σe (l)). Example 3. Figure 3 presents the result of PTom (Figure 2) over the graph G (Figure 1).

Bill Paxton

Tom Hanks

Audrey Tautou

Tom Hanks

{born: 1955}

{born : 1956,

{born : 1956,

bornAt:Concord,US}

bornAt:Concord,US}

acted in {role : Jill Novel}

acted in

acted in {role : Sophie Neveu}

acted in

Apollo 13

The De Vinci Code

{released: 1995}

{released: 2008}

g1

g2

Figure 3: Answers of PTom over G

150

Some works introduced fuzzy preferences in the querying of graph-based data, mainly for RDF [26] and graph databases [27]. In these approaches, the semantics of flexible queries relies on fuzzy set theory, which was introduced by Zadeh [28] for modeling classes or sets whose boundaries are not clear-cut. For such objects, the transition between full membership and full mismatch is

155

gradual rather than crisp. Typical examples of such fuzzy classes are those described by adjectives of the natural language, such as young, cheap, recent, fast, etc. Formally, a fuzzy set F on a referential U is characterized by a membership function µF : U → [0, 1] where µF (u) denotes the grade of membership of u in F . In particular, µF (u) = 1 reflects full membership of u in F , while µF (u) = 0

160

expresses absolute non-membership. When 0 < µF (u) < 1, one speaks of partial membership. Two crisp sets are of particular interest when defining a fuzzy set F : the core C(F ) = {u ∈ U | µF (u) = 1}, which gathers the prototypes of F , and the support S(F ) = {u ∈ U | µF (u) > 0}. In practice, the membership function associated with F is often of a trape8

165

zoidal shape, like that depicted in Figure 4. Then, F is expressed by the quadruplet (a, b, c, d) where C(F ) = [b, c] and S(F ) =]a, d[.

µF

1

0

a

c

b

d

U

Figure 4: Trapezoidal membership function

The α-cut of a fuzzy set F , denoted by F α , is an ordinary set of elements whose satisfaction degree is at least equal to α: F α = {u ∈ U | µF (u) ≥ α}. Thus, C(F ) and S(F ) are two particular α-cuts of F where α is respectively 170

equal to 1 and 0+ . Let F and G be two fuzzy sets on the universe U , we say that F ⊆ G iff

µF (u) ≤ µG (u), ∀u ∈ U . The complement of F , denoted by F c , is defined by µF c (u) = 1 − µF (u). Furthermore, F ∩ G (resp. F ∪ G) is defined the following

way: µF ∩G (u) = min(µF (u), µG (u)) (resp. µF ∪G (u) = max(µF (u), µG (u))). 175

As usual, the logical counterparts of the theoretical set operators ∩, ∪ and complementation operator correspond respectively to the conjunction ∧, disjunction ∨ and negation ¬. See [29] for more details. Introducing fuzzy conditions into graph pattern queries [20, 30, 21] allows

180

specifying preferences over different elements of the graph: its attribute values or its structure (paths connecting the nodes, shape of the searched subgraph pattern), or both. In the approach presented hereafter, fuzzy preferences go beyond those considered in these previous works, as they rely on an extended graph model and may also concern quality metadata. These aspects are developed in

185

the two following sections.

9

3. A Quality-Aware Graph Data Model Data quality is a complex concept, which embraces different semantics depending on the context [31]. It is described through a set of quality dimensions aimed to categorize criteria of interest. Classical quality dimensions are com190

pleteness (the degree to which needed information is present in the collection), accuracy (the degree to which data are correct), consistency (the degree to which data respect integrity constraints and business rules) and freshness (the degree to which data are up-to-date). Data quality over a dimension is measured according to a set of metrics that allow a quantitative definition and evaluation

195

of the dimension. Examples of metrics are “the number of missing metadata” for the evaluation of the completeness, “the number of misspelled film titles” for the accuracy, “the number of actors who are said to play in a film after their date of death” for the consistency. These are simple examples but the literature proposes a large range of dimensions and metrics, conceptualised in

200

quality models [3]. Metrics may be defined and measured at different levels of the data. To get back to the case of graph data, a metric can concern a node (or one of its attributes), e.g., the freshness degree of the information embedded in the node may be defined according to the date of the last update, or the set of nodes of the graph, e.g., the number of nodes that can be considered as fresh

205

in the database. Let us now turn to the problem of modeling quality information, which implies associating quality metrics with elements of the graph data. Remark 1. The scope of the framework does not include the problem of assessing data quality, which is another very large topic [3]. It is considered that

210

the quality measures are available. The studied issue is their modeling and their exploitation when querying the data.

10

3.1. Topology of Data Quality Measures We first discuss the definition of the vocabulary devoted to the description 215

of quality information (called “quality vocabulary” in the following). Let us assume the existence of two disjoint sets that denote elements of the quality vocabulary: a set of quality metrics Met and a set of quality dimensions Dim. Let us also assume the existence of two other disjoint sets of quality nodes VMet and VDim .

220

Definition 3. A quality vocabulary Voc = (Vvoc , Rvoc ) is a tree such that: • Vvoc ⊆ VMet ∪ VDim ∪ {Quality} is a set of quality nodes (each element of this set may denote a dimension or a metric), • Rvoc = {r | r ∈ Vvoc × Vvoc } is a set of edges expressing the fact that a metric contributes to the definition of a quality dimension,

225

• and the node Quality is the root of the tree. Every internal node nint is a dimension (nint ∈ VDim ). Each leaf nleaf is a quality metric (nleaf ∈ VM et ) and has a parent quality dimension (∃d ∈ Vvoc | d ∈ VDim ∪{Quality} and (d, nleaf ) ∈ Rvoc ). The set of metrics of a quality vocabulary Voc is denoted by M etrics(Voc).

230

This modeling is a classical vision of the quality concepts modeled as a hierarchy of quality metrics classified into quality dimensions. Let us note that a given dimension may sometimes be decomposed into several subdimensions. For instance, the accuracy dimension can be refined into syntactic accuracy and 235

semantic accuracy. Example 4. We assume the existence of some quality metrics named maccT itle , maccRel , maccN ame , mcompF ilm , mcompRole , mcompActor , mlastU pdate , which will be detailed later. Figure 5 is a simple example of a quality vocabulary containing some dimensions (internal nodes), which are Accuracy, Completeness and Fresh-

240

ness, and then Syntactic accuracy and Semantic accuracy attached to the Accuracy 11

Quality Accuracy Syntactic accuracy maccN ame

Completeness Semantic accuracy

maccT itle

mcompF ilm

mcompRole

Freshness mcompActor

mlastU pdate

maccRel

Figure 5: A quality vocabulary

dimension. Quality metric (leaves) are attached to these dimensions according to the kind of problem they intent to consider.  3.2. Attaching Quality Measurements to Graph Data Now that the quality vocabulary is defined, the question of modeling quality 245

information in the data graph is still open. The first contribution of this work is to provide a formalism making it possible to represent very detailed quality information, including information about the quality of attribute values. The modeling of quality information should not degrade the understandability of the data themselves, so there must be a clear separation between the data and the

250

quality metadata. Roughly speaking, quality information is associated with nodes and attributes of the data graph. For readability reasons, quality information is not modeled inside the data graph itself, using additional attribute/value pairs for instance, but in another graph that “mirrors” (part of) the data graph1 .

255

Let us now assume the existence of a set of quality nodes VQ that is disjoint from V. Let G = (V, R, ζV , ζR ) be an attributed graph, and Voc be a quality vocabulary. The extension of the data graph G with quality information conceptually consists of two steps. In the first step, the idea is to add quality information to

260

nodes and to their attributes. 1 Let us note that an attributed graph does not include the notion of namespace that classifies elements of the graph.

12

Definition 4. A data graph Gext extended with quality information is a tuple (V 0 , R0 , ζV , ζR ) such that:

• V 0 = V ∪ VQ where VQ ⊆ VQ . Intuitively, the graph now contains new nodes, which store quality information (see next item). 265

m m • ζQ = {ζQ | m ∈ Metrics (Voc) and ζQ : VQ → [0, 1]} are measurements

of quality metrics associated with quality nodes, which can be seen as keyvalue pairs where keys are quality metrics and values their measurements. • R0 = R ∪ RQ . Intuitively, edges are enriched by the set of relations RQ , used to attach quality nodes to data nodes, that model the connections 270

Q Q needed for attaching the quality data. RQ = {rhasM easure } ∪ {ra | a ∈ A}

contains two kinds of relations: Q – The first relation rhasM easure ⊆ V × VQ connects each node n to

its quality node (if it exists, i.e., if quality information regarding n is available). A relation (n, nQ ) attaches a quality node to the data node 275

n. The quality node embeds, in its attributes, quality information concerning n globally. – The second relation, {raQ | a ∈ A}, attaches quality information to at-

tributes of data nodes. Each relation raQ ⊆ VQ × VQ attaches quality

information (if it is available) to the quality nodes associated with 280

data nodes that have att as an attribute. More precisely, for every data node n, for every attribute att of n for which a quality information is available, there are a node nQ att and an edge labeled att from nQ (the quality node associated with n) to nQ att . The set of quality nodes connected to the node n (which is the data quality

285

subgraph that concerns n) is denoted by Qual(n).  Example 5. Let us consider the node from the data graph G that refers to the film named The De Vinci Code, released in 2008. There are quality problems here. First, the title of the film is inaccurate (should be The Da Vinci Code). Second, 13

the release value is wrong (should be 2006). The inaccuracy may be measured 290

by several metrics, processed by users or automated tools. Let us assume that a normalised metric maccT itle was used to assess the accuracy, leading to a value of 0.65. In the same spirit, let us assume that the accuracy of the release date is measured by maccRel and yields 0.7. Let us also suppose that the completeness degree of the node is measured by mcompF ilm and is equal to 0.3 (which can be

295

explained by the fact that a lot of information is missing for this film: music, production, rest of the cast, etc). The quality information extending the data nodes, attached to the The De Vinci Code node, is represented with dashed elements in Figure 6. The De Vinci Code

hasMeasure

{released: 2008}

released

{maccT itle : 0.65, mcompF ilm = 0.3}

{maccRel = 0.7}

Figure 6: Attaching quality to a node (The De Vinci Code)

Here, if g is the subgraph made of the node The De Vinci Code, then Qual(n) 300

is composed of the two dashed (quality) nodes. Another example concerns the node that refers to the actor named Audrey Tautou. Let us assume that a normalised metric mcompActor measured the com-

pleteness of this node, for whom the birthdate is missing, resulting to the value of 0.5 (such a computation could be made, for instance, by a collaborative pro305

cess). Let us also assume that another metric maccN ame checked the syntactic accuracy of the actor’s name (for instance checked the existence of the name in a reference database), leading to the value of 1 (meaning that no problem was detected). Then the quality information attached to the Audrey Tautou node would be the one of Figure 7. Audrey Tautou

hasMeasure

{mcompActor : 0.5, maccN ame = 1}

Figure 7: Attaching quality to a node (Audrey Tautou)

14

310

In this formalization, every quality metric belongs to the quality vocabulary defined in Figure 5.  Let us now explain how quality information can be attached to edges and their attributes. Due to some limitations regarding the attributed graph model, this part imposes a slightly more complex modeling.

315

Example 6. Figure 8 represents the relation acted in that connects Bill Paxton to Apollo 13 film (the two nodes respectively play the roles of n1 and n2 in the definition above). The name of the role is missing, leading to a completeness value of mcompRole = 0.3 where mcompRole is classified in (ie. is a child of ) the Completeness dimension in the quality vocabulary. (Of course, some other quality

320

metrics can be attached if needed.) Bill Paxton

hasMeasure

{born: 1955}

acted in {mcompRole :0.3}

acted in

Apollo 13

hasMeasure

acted in

{released: 1995}

Figure 8: Attaching quality to an edge

Here, Qual(g) (where g is the subgraph made of the two data nodes and the edge connecting them) is made of the three quality nodes represented in the right part of Figure 8.  Definition 4 (cont.). In order to attach quality information to edges and their 325

attributes, RQ is enriched with another set of relations {reQ | e ∈ E and reQ ∈ VQ × VQ }. These quality relations connect quality nodes such that the associated data nodes are connected through data edges. This enables qualifying the quality of data edges and their attributes. For every edge (n1 , n2 ) ∈ re in the data graph, whose extremity nodes are respectively associated with the quality nodes

330

Q Q Q nQ 1 and n2 (i.e., (n1 , n1 ), (n2 , n2 ) ∈ hasMeasure), one creates — if needed,

i.e., if some quality information about re (n1 , n2 ) is available — two edges in 15

Q Q the quality graph: (nQ 1 , nr ) and (nr , n2 ), both belonging to re . Then, the node

nr embeds the quality information available about the relation re (n1 , n2 ). The set of quality nodes connected to the edge e (which is the data quality sub335

graph that concerns e) is denoted by Qual(e).  Intuitively, the quality graph is a kind of “mirror” of the data graph. By extension, the sets of quality node connected to a subgraph g, denoted by Qual(g), comes trivially from Definition 4. Notice that, if needed, a quality node could be attached to each attribute of

340

an edge, in the same way that it is made for nodes. For the sake of simplicity of the presentation, we omit this aspect here.

4. Quality-Aware Fuzzy Queries The modeling of quality information is rather complex and cannot be easily grasped by an end user. The query language introduced hereafter is a way to 345

hide this complexity to the user. It makes it possible to query the data with preferences related to quality information, in a user-friendly and intuitive way. Definition 5 (Quality-aware preference query). A quality-aware preference query is a triple (P, Ppref , Laware ) where: • P a graph pattern that defines the shape of the data to be retrieved (see

350

Section 2.2), • Ppref is a recursively defined preference over quality data, that takes one of the following forms: – tQT (g) is T , where tQT ∈ Voc, g is a subgraph pattern of P (in its simplest form, it can be one of the variables of P) and T is a fuzzy

355

term, or – p1 ⊗ p2 where p1 and p2 are fuzzy preferences, and ⊗ is a fuzzy connective,

16

• Laware is a list of elements of the form tQT (g 0 ) where g 0 denotes a subgraph of P (one of the variables of P in its simplest form). 360

The first element is the search pattern. The second element of the qualityaware query makes it possible to associate a quality score with each answer, that corresponds to the fuzzy satisfaction degree obtained by the answer according to the preference condition. The third and last element makes it possible to retrieve a set of quality measures that can be different from those involved in

365

the preference condition. Note that, for the last two items, the quality elements are either metrics or dimensions of the quality vocabulary. Remark 2. A classical pattern query is a specific case of a quality-aware preference query for which the last two elements of the triple are > and ∅ respectively. Example 7. Let us consider the following pattern query QTom . (PT om , Accuracy(m) IS high, hCompleteness(p2), Completeness(r)i) Such a query aims to retrieve the subgraphs that map with PT om (see Figure 2),

370

viz. the actors (variable p2 in PTom ) who played in a film (variable m) with Tom Hanks, for which the accuracy of the film is high. Associated with the answer,

the completeness quality score attached to the second actor and the completeness quality score attached to the relation that connects the second actor to the film are required. 375

Let us now define the semantics of a quality-aware preference query. In order to do so, one must first define the notion of a satisfaction degree related to a quality preference. Definition 6 (Quality satisfaction degree). Let us consider a vocabulary Voc = (Vvoc , Rvoc ) and an extended data graph (V ∪ VQ , R ∪ RQ ). Let Agg be

380

an aggregation function (for instance a min or average in its simplest form). The satisfaction degree of a data subgraph g ⊆ (V, R) relatively to a quality element eltq ∈ Vvoc (which, by definition, denotes a quality metric, a quality 17

dimension or the quality root), denoted by qscore(g, eltq ), is recursively defined by bottom-up computation in the quality vocabulary tree: 385

elt

• if eltq ∈ Metrics (Voc) then qscore(g, eltq ) = Aggn ∈ Qual(g) ζQ q (n). This means that the quality score is the aggregation of the metric quality scores associated with the elements of g (the default value being 1); • else (eltq is a dimension or the quality root), the score related to eltq results from an aggregation of the scores related to its children nodes in the quality vocabulary tree: qscore(g, eltq ) = Aggch∈Ch qscore(g, ch) where Ch = {ch | (eltq , ch) ∈ Rvoc }. Let us now define the semantics of a quality-aware preference query.

390

Definition 7 (Quality-aware query semantics). Given a quality vocabulary Voc = (V, r), an extended data graph G = (V ∪ VQ , R ∪ RQ ) and a quality aware query Q = (P, Ppref , hl1 (g1 ), ..., lk (gk) i), the interpretation [[Q]]G of Q over G is the set of answers having the form of triples (g, µpref , Lµ ) such that • g ∈ [[P]](V,R) , meaning that g is an answer relatively to (V, R) (here,

395

quality information does not play any role; see Section 2.2), • µpref is the quality satisfaction degree of g, recursively defined according to the form of Ppref by – µpref = µF T (qscore(g, q)), if Ppref is of the form tQT (g) is F T , where tQT ∈ Voc, g is a subgraph pattern of P, F T is a fuzzy term, and

400

µF T denotes F T ’s membership function (see Section 2.1); – µpref = ψ⊗ (µp1 , µp2 ), if Ppref is of the form p1 ⊗ p2 where p1 and p2 are fuzzy preferences, ⊗ is a fuzzy connective, and ψ⊗ is the interpretation of ⊗ (for instance, ψand = min). • Lµ = hsc1 , ..., sck i where sci = qscore(gi , li ) for each i ∈ [1..k]. 18

405

For a candidate answer g, getting µpref = 0 is equivalent to not being in the answer set. Remark 3. If the pattern P is itself fuzzy (see [21]), then each answer also gets a degree that expresses the extent to which it satisfies the pattern. Then, one could either combine the pattern satisfaction degree with the quality satisfaction

410

degree, or return a pair of degrees attached to each answer. We believe that these two degrees should not be merged as their semantics are clearly different (one degree concerns the pattern query satisfaction in terms of data, and the other degree concerns the quality assessment in terms of metadata). Let us also note that the quality preferences are defined in terms of a simple

415

and unique taxonomy that classifies the quality terms of the vocabulary, but the taxonomy could be more expressive by adding user-dependent weights reflecting the extent to which a child participates in the definition of its parent, as proposed in [15]. Example 8. Let consider the fuzzy term high that models the fact that be-

420

tween 0 and 0.6, a given value (expressing, for instance, a percentage) is definitely not high, and that beyond 0.8, it is definitely high. Between 0.6 and 0.8, the satisfaction of the predicate is gradual. The membership function µhigh is represented in Figure 9.

1 µhigh

0 δ = 0.6

γ = 0.8

1

Figure 9: Representation of the fuzzy term high

Let us now return to the interpretation of the quality-aware query QTom

19

reminded hereunder, which was previously defined in Example 7. QTom = (PT om , Accuracy(m) IS high, hCompleteness(p2), Completeness(r)i) For QTom , each answer has the form of a triple (subgraph answer, 425

score of Accuracy(m) IS high, hscore of Completeness(p2), score of Completeness(r)i). The

second

element

of

each

answer

corresponds

to

the

value

of

µhigh (qscore(m, accuracy)). The answer of QTom over the data graph G of the running example (Figure 1 with added quality information) is the following one (taking min as the 430

aggregation operator Agg introduced in Definition 6).

[[QTom ]]G = {(g1 , 1, h1, 0.3i), (g2 , 0.25, h0.5, 1i)} where g1 and g2 are the subgraphs from Figure 3 (i.e., the data subgraphs that match PT om ). For g1 (where the variable m of the query pattern matches in the film Apollo 13, and the variable p2 matches in Bill Paxton), there is no accuracy problem 435

detected for the retrieved film so the quality preference Accuracy(m) IS high has the satisfaction degree of 1. The quality score of Completeness(p2) is also evaluated to 1 as the Bill Paxton node has no quality problem detected according to its completeness. However, the completeness of the relation acted in (matching the variable r of the query pattern) that connects the Bill Paxton node to the Apollo 13

440

one is incomplete (see Example 6), its completeness score being of 0.3 (value of the mcompRole metric, which is classified in the Completeness dimension in the quality vocabulary and is the only completeness metric associated with this edge). For g2 (where the variable m of the query pattern matches in the film The De Vinci Code, and the variable p2 matches in Audrey Tautou), one has

445

µhigh (0.65) = 0.25 where 0.65 is the minimum ( Agg aggregation function chosen here) of the values of the accuracy metrics associated with m, which are the maccT itle and maccRel metrics associated with the film The De Vinci Code for g2

20

(see Example 5). The quality score of Completeness(p2) is 0.5, which is the value of the only completeness metric attached to the Audrey Tautou node (Example 5). 450

In g2 , the relation acted in (matching the variable r of the query pattern) that connects the Audrey Tautou node to the The De Vinci Code one is complete (so its completeness score is 1).

5. Algorithm and Cost The cost of a graph pattern query evaluation depends on the form of the 455

pattern. The data complexity over arbitrary graph pattern queries (conjunctive regular path queries including regular expressions with variables on edges) is NP-complete [32]. Tractable fragments are based on syntactical restrictions of the pattern that either take the form of a regular path query (the pattern is only a path connecting two nodes, defined by a regular expression) or more generally

460

a conjunctive regular path query (a graph pattern where each edge is a regular path expression) possibly including specific navigational capabilities like e.g. inverse traversal, nested regular expressions, or memory registers (see e.g. [32, 33]). For instance, in the case of Conjunctive Regular Path Queries (CRPQ) Q, meaning patterns with variables allowed on nodes, and edges labelled by

465

expressions of a regular language defined over E, the cost of the evaluation is of the order |G|O(|Q|) [33, 34]. If the considered pattern is a 2RPQ (regular

path queries with reverse traversal) L, then the evaluation is in linear time O(|G| · |L|) [32]. In order to exhibit the additional cost incurred by quality evaluation and 470

to provide an algorithm that guides the implementation of the approach, we propose an extension of the generic graph pattern evaluation algorithm proposed in [35], which constitutes a common framework to other state-of-the-art algorithms including GraphQL, QuickSI, SPath and SwiftIndex (see [35] for the comprehensive list of studied algorithms). This algorithm, originally designed

475

for comparing some state-of-the-art algorithms, is composed of a generic skeleton that invokes subroutines whose implementation is specific to each algorithm.

21

Without giving all the details of the generic algorithm, the intuition of its flow logic is that, given a pattern query Q and a data graph G, the algorithm calculates [[Q]]G by a trial and error procedure, that recursively constructs — one 480

edge at a time — each answer mapping.

The extension for handling quality aware queries, based on the approach proposed in Sections 3 and 4, aims at calculating a quality score associated with each answer. This impacts the algorithm of [35] in several ways. The first 485

impact concerns the data structure used for to modeling an answer. The second impact concerns the algorithm itself. Data structure. The result of a quality-aware query associates two kinds of quality information with each answer mapping, meaning that the data structure modeling an answer now has to be a triple, as specified in Definition 7, composed

490

of (1) a mapping denoted by M (which links a subgraph answer to the query pattern), (2) a quality score denoted by µpref and (3) a set of quality elements of interest denoted by Lµ . Algorithm extension. Let us now consider the query answering process provided by the generic algorithm. Its core subroutine, SubgraphSearch, recursively

495

searches for the answers of the given pattern. Its initial version implements by a conditional instruction given in Algorithm 1. Algorithm 1: SubgraphSearch initial routine of the generic algorithm 1 2 3 4

if |M | = |V (q)| // If M is complete and satisfactory according to Query q then report M ; .. Following of the initial subroutine of SubgraphSearch ..

In lines 1 to 3, the SubgraphSearch subroutine checks if a mapping M , which is recursively under construction, maps all the vertices of the query (the set of vertices of a query q is denoted by V (q)). If so, M is an answer and is then 500

returned in the set of answers to q. The following of the routine (starting from

22

line 4) consists of the recursive construction of the other answers (it is complex of course2 but remains unchanged in our extension). In order to provide quality information associated with each answer, the SubgraphSearch subroutine, has to be extended. It has to compute qual505

ity scores for each complete answer that is found. Algorithm 2 extends Algorithm 1 in order to introduce a quality score calculation associated with each answer. In the following, the quality vocabulary (see Definition 3) is denoted by Voc = (Vvoc , Rvoc ). The extension of the subroutine introduces a set of new variables {Qt(g1 ), · · · , Qt(gn )}, where gqt = {g1 , · · · , gn } are the quality sub-

510

graphs that appear in the Ppref and Laware quality-aware specifications of the query (see Definition 5). Each Qt(gi ) is a quality vector of size |M etrics(Voc)| that associates, for the subgraph gi of the answer M , a quality score with each quality metric defined in the quality vocabulary. The extension instantiates the quality vectors according to the quality mea-

515

sures attached to the data subgraph denoted by the mapping, searching for measures attached to nodes (lines 6 to 8), attributes (lines 9 to 12), and then edges (lines 13 to 16). The final quality scores µpref and Lµ are calculated by the routines computePref and filter, in lines 17 and 18 respectively. Each of these routines implements a recursive bottom-up traversal of the quality profile

520

for calculating quality scores, and possibly the final application of a membership function (see Definition 7). These instructions clearly have a negligible cost. Finally, the complete answer is reported in line 19. The additional cost introduced by the quality-awareness feature results from the computation of the score associated with the internal elements of qtInterest for each subgraph answer. For each retrieved answer, the cost of the score 2 See

[35] for details.

23

Algorithm 2: SubgraphSearch additional routine 1 2

if |M | = |V (q)| // If M is complete and satisfactory according to Query q then // 1- Initialization of the quality vector

3

foreach (gi , m) ∈ gqt × M etrics(Voc) do Qt(gi )[m] ← 1; // 2- Instanciation of the quality vector // 2.1- w.r.t. the nodes (see Figure 6)

4 5 6

foreach node n s.t. (u, n) of M // viz. the node u of the query q maps in the node n of G

do

Retrieve nQ s.t. hasM easure(n, nQ ); // then one integrates metrics values attached to the quality node:

7 8

foreach attribute a of nQ and foreach gi s.t. u ∈ gi do Qt(gi )[a] ← Agg(Qt(gi )[a], nQ .a); // where nQ .a is the value of a for nQ

// then one integrates metrics values attached to attributes of n: 9 10 11 12

foreach attribute a of n do Q Q Retrieve nQ a s.t. a(n , na ); Q Q foreach attribute a of nQ a and foreach gi s.t. a(n , na ) ∈ gi do Qt(gi )[a] ← Agg(Qt(gi )[a], nQ .a); // where nQ .a is the value of a for nQ

// 2.2- w.r.t. the edges (see Figure 8) 13 14

foreach edge {(u1 , n1 ), (u2 , n2 )} ⊆ M s.t. edge(u1 , u2 ) in q do Q Q Retrieve nQ s.t. hasM easure(n1 , nQ 1 ) and edge(n1 , n ); // then one integrates metrics values attached to the quality node:

15 16

foreach attribute a of nQ do Qt[a] ← Agg(Qt[a], nQ .a); // nQ .a is the value of a for nQ // 3- Computation of the quality scores

17 18 19 20

µpref := computePref(Qt,Voc); Lµ := filter(Qt,Voc); report (M, µpref , Lµ ); .. Following of the initial subroutine of SubgraphSearch ..

24

computation is in: O( Σ(g,µpref ,Lµ )∈[[Q]]G |gqt | · (|Ng | + |Eg |) · |Metrics(Voc)| ) {z } | {z } | Cost of browsing the subgraph For each quality subgraph in the query

answer in order to retrieve the attached quality values

where Ng (resp. Eg ) is the set of nodes (resp. edges) in the answer g and gqt are the quality subgraphs that appear in Ppref and Laware . Assuming that the number of quality subgraphs that appear in a query is smaller than the size of the query pattern, and that the width of the taxonomy is smaller than G (which are genuinely reasonable assumptions), this cost is less than: O(|[[Q]]G | · |G|2 ). In the case of a CRPQ, this additional cost is highly dominated by the cost of the evaluation without quality awareness, which is in |G|O(|Q|) . If the considered 525

pattern is a 2RPQ then this cost seems reasonable w.r.t. a non-quality aware evaluation, which is in O(|G| · |Q|) [32]. 6. Implementation Issues In order to concretely illustrate the previously defined framework, an extension of the cypher query language [10], available in the Neo4j graph database

530

management system [11], has been implemented. For carrying out such an implementation, two issues have to be considered: the implementation of the data model, and the implementation of the extended query language, named qualypher (including of course the query processing part). Data Model. The data model is rather easy to implement in the system as

535

an extended data graph keeps the form of an attributed graph. In order to dissociate the data from the related quality information, the elements of the quality extension, referred to as VQ and RQ in the definition (and drawn as 25

dashed elements in the figures), have a quality type QT (that does not belong to types associated with data in order to avoid ambiguity between the data and 540

the quality annotations). The quality vocabulary is a tree, so it can also easily be modeled as a graph in the system. Then, the data that are stored in the system, named DB in the following, are composed of the data graph itself, the quality vocabulary, and the quality information, all these data being modeled as a graph, as depicted in Figure 10.

Graph G

hasMeasure

Quality information

Quality vocabulary Voc

Graph database DB

Figure 10: Data stored in the graph database management system

545

Extension of cypher. The implementation of the pattern query notion (syntax and semantics) results in the extension of the cypher query language, called qualypher, that integrates the quality-aware features previously introduced. The extension amounts to adding three new clauses to the language. First, a clause DEFINE serves to specify the fuzzy terms involved in the graph

550

pattern [21] or in the quality-related preference condition. Then, the three elements of a quality-aware query (P, Ppref , Laware ) have to be specified. The classical cypher MATCH/WHERE clauses make it possible to define the pattern P. Two new clauses are then introduced. The clause QTPREF serves to define the quality preference element Ppref . Finally, a clause QTAWARE allows specifying the

555

list Laware of quality measures of interest. Query 1 hereafter is the expression in qualypher of the quality-aware query QT om from the running example.

26

Query 1: A qualypher query D E F I N E A S C high AS (0.6 ,0.8) MATCH ( tom ) -[: acted_in ] - >( m ) , ( m ) < -[ r : acted_in ] -( p2 ) 560

WHERE tom . name = ’ Tom Hanks ’ QTPREF accuracy ( m ) IS high QTAWARE completeness ( p2 ) , completeness ( r ) RETURN tom , m , p2

Remark 4. In this paper, we focus on fuzzy preferences concerning quality. 565

However, it is also possible to consider fuzzy preferences on the data themselves (which would then correspond to fuzzy conditions appearing in the where clause). This aspect was studied in detail in [21]. When both types of fuzzy conditions are present in a query, the two satisfaction degrees attached to an answer should not be aggregated in order to compute a global satisfaction degree, which could raise

570

interpretation difficulties. The two satisfaction degrees should be kept separated (see Remark 3). Query-Processing. Implementation-wise, two architectures may be thought of when it comes to processing qualypher queries. A first one consists in implementing a specific quality-aware query evaluation engine. The advantage of

575

this solution is that optimization techniques that are implemented directly in the engine should make query processing very efficient. The downside is that quality-aware queries would then be impossible to evaluate by an independent engine that is not equipped with the quality-aware functionalities. The second solution, — the one used here —, consists in using a (possibly

580

distant) classical engine, combined with a dedicated add-on layer. The implementation relies on a query-rewriting derivation mechanism, carried out as a pre-processing and a post-processing step.

27

Neo4j querying

Quality add-on module layer

Client (user)

Query Qqt aware with quality awareness

Answers of Qqt aware

Compiling module (query transformation)

Quality score calculation module

Cypher query Qderived

Answers of Qderived

Classical Neo4j query evaluator engine

DB Figure 11: Query processing: software architecture

The add-on layer is composed of two modules. A compiling module transforms the graph pattern qualypher query Qqt aware into a classical cypher 585

query Qderived aimed to retrieve all the needed information from the user profile, the quality vocabulary and of course the data themselves. This derived extended query is then sent to the (classical) Neo4j engine. Based on the answers to Qderived , a quality score calculation module computes the quality scores associated with the answers to Qqt aware 3 .

590

The result of Query 1, through the Neo4j system extended by the qualityaware extension, would then take the form of the table given in figure 12. This implementation acts as a proof-of-concept that simply shows the feasibility of the proposed approach to the modeling of quality metadata in attributed graph and their use within fuzzy queries. 3 Note that this calculation cannot be integrated in the extended query, due to the limited expressiveness of the language in that respect. A calculator module is then necessary.

28

tom

m

{name: ”Tom Hanks”, born: 1955, bornAt: ”Concord,US”} {name: ”Tom Hanks”, born: 1955, bornAt: ”Concord,US”}

{name: ”Apollo13”, released: 1995} {name: ”The De Vinci Code”, released: 2008}

p2

r

accuracy(m) is high

completeness(p2)

completeness(r)

{name: ”Bill Paxton”, born: 1955}

{name: ”acted in”}

1

1

0.3

{name: ”Audrey Tautou”}

{name: ”acted in”, role: ”Sophie Neveu”}

0.25

0.5

1

Figure 12: Answer of Query 1

595

7. Conclusion and perspectives This paper proposed a framework that allows introducing fuzzy quality preferences into queries for graph databases. First, theoretical foundations are defined, concerning the data model, and then the query processing. Concerning the data model, an extension of the graph data model is defined. It makes it

600

possible (i) to express quality information and (ii) to attach quality scores to data. Then the notion of a quality-aware query extends the notion of a graph pattern query, by adding quality preferences to query. Its semantics are defined according to the extended data model. Second, the evaluation process of a quality-aware query is studied (i) by extending a generic state-of-the-art algo-

605

rithm for introducing quality awareness computation at query evaluation time, and (ii) by exhibiting the evaluation cost associated with this extension, in terms of theoretical complexity. Finally, the concrete application of the framework is discussed through is concrete implementation above the Neo4j database management system. We proposed an extension of the cypher query language, and

610

architecture as a first step toward the implementation of the framework.

This work opens a lot of research perspectives, concerning both the theoretical contribution (the modeling of quality information) and the implementationrelated one. 615

Unknown values. In the current framework, 1 is the default value when no assessment is available on a given quality metric or dimension. This means that, when the quality information is missing (unknown), the best quality score is

29

given by default. In practice, unknown values should generate a special treatment that alerts the user on the imprecision of the quality score resulting from 620

the computation of incomplete quality information. The question of how to deal with unknown values (also known as null values in database) still remains to be studied in this context. Fuzzy data model. The extension of the quality-awareness should be studied in order to be extended to the more general data model of fuzzy data graphs, where

625

a degree is attached to edges in order to express the “intensity” of a gradual relationship (e.g., likes, is friends with, is about). The question would then be: how to combine quality information scores with the satisfaction degree of an answer, which also depends on the intensity of the relationships in data [20]. Fuzzy quantified quality-aware queries. Fuzzy quantified queries have been thor-

630

oughly studied in a relational database context, see e.g. [36], where they serve to express conditions about data values. In a graph database context, a new dimension can be exploited that concerns the structure of the graph, as proposed in [37, 21]. The question of dealing with quality preferences expressed through fuzzy quantified statements is also a perspective. One could then ex-

635

press quality-aware queries of the form ”retrieve the films released in 2009 such that (the information about) most of the actors that played in this film is highly accurate”, where most of is a fuzzy quantifier. Modeling quality. The literature proposes a lot of conceptual models for modeling quality (see the surveys of [3, 38]). Such models are very complete, and

640

go beyond the classification of the quality metrics according to quality dimensions, by including complex provenance information concerning the quality values (measurement methods, date, tools, certification, actors, etc.). The vocabulary that is used for defining the quality could be extended in order to model such information, which would then raise the issue of offering a user-friendly

645

quality-aware query language for using it.

30

Multi-criteria decision making. Multi-Criteria Decision Making (MCDM) is the process that aims to find, for a given problem, the best alternatives among all the possible ones, in the presence of multiple (possibly conflicting) criteria. The approach proposed here can be seen, from the MCDM point of view, as a simple 650

MCDM method based on a single formula that combines the quality preferences (which are the multiple criteria), in order to rank the answers retrieved by the interpretation of a query. Some methods proposed by the MCDM community could be considered in order to enrich the approach, especially for making it easier for the users to define their quality criteria. Fuzzy set theory, used here

655

for expressing quality criteria, could be replaced by its intuitionistic extension so as to allow users to express vagueness in their quality criteria [39]. Considering conflicting collaboratively defined criteria could be also be studied, as in [40, 41, 42], who use Hesitant Fuzzy Linguistic theory in order to combine several possible linguistic definitions associated with a linguistic pattern.

660

Use cases. Not much work has been done so far in the context of data quality management in attributed graphs. Real datasets that could serve as a basis for relevant benchmark studies still miss in literature. If the tractability of the solution is somewhat predictable, its usability, from the user’s point of view (based on users’ feedbacks) is a crucial issue that has yet to be tackled.

665

References [1] A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, S. Auer, Quality assessment for linked data: A survey, Semantic Web 7 (1) (2016) 63–93. [2] M. J. Eppler, M. Helfert, Classification and analysis of data quality costs, in: Proceedings of the International Conference on Information Quality

670

(IQ), 2004. [3] C. Batini, M. Scannapieco, Data Quality: Concepts, Methodologies and Techniques, Data-Centric Systems and Applications, Springer, 2016.

31

[4] C. Batini, C. Cappiello, C. Francalanci, A. Maurino, Methodologies for data quality assessment and improvement, ACM Computing Surveys 41 (3) 675

(2009) 16:1–16:52. [5] R. Angles, C. Gutierrez, Survey of graph database models, ACM Computing Surveys 40 (1) (2008) 1–39. [6] R. Giugno, D. Shasha, Graphgrep: A fast and universal method for querying graphs, in: Proceedings of the International Conference on Pattern

680

Recognition (ICPR), 2002, pp. 112–115. [7] D. Dominguez-Sal, P. Urb´ on-Bayes, A. Gim´enez-Va˜ n´ o, S. G´ omez-Villamor, N. Mart´ınez-Bazan, J.-L. Larriba-Pey, Survey of Graph Database Performance on the HPC Scalable Graph Analysis Benchmark, in: Proceedings of WAIM’10 Workshops, 2010, pp. 37–48.

685

[8] C. Vicknair, M. Macias, Z. Zhao, X. Nan, Y. Chen, D. Wilkins, A comparison of a graph database and a relational database: a data provenance perspective, in: Proceedings of the ACM Southeast Regional Conf., 2010, p. 42. [9] R. Angles, A comparison of current graph database models, in: Proceedings

690

of IEEE International Conference on Data Engineering (ICDE) Workshops, 2012, pp. 171–177. [10] Neo Technology, The Neo4j Manual v2.0.0, part III (2013). [11] Neo4j web site, www.neo4j.org. [12] A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, S. Auer, Qual-

695

ity assessment for linked data: A survey, Semantic Web 7 (1) (2016) 63–93. [13] F. Radulovic, N. Mihindukulasooriya, R. Garc´ıa-Castro, A. G´ omez-P´erez, A comprehensive quality model for linked data, Semantic Web 9 (1) (2018) 3–24.

32

[14] J. Debattista, M. Dekkers, C. Gu´eret, D. Lee, N. Mihindukulasooriya, 700

A. Zaveri, Data on the web best practices: Data quality vocabulary, W3C Working Group (2016). [15] P. Rigaux, V. Thion, Quality Awareness over Graph Pattern Queries, in: Proceedings of the International Database Engineering & Applications Symposium (IDEAS), 2017.

705

[16] P. N. Mendes, H. M¨ uhleisen, C. Bizer, Sieve: linked data quality assessment and fusion, in: Proceedings of the Joint EDBT/ICDT Workshops, Citeseer, 2012, pp. 116–123. [17] M. Zneika, D. Vodislav, D. Kotzinos, Quality metrics for RDF graph summarization, Semantic Web (Preprint) (2019) 1–30.

710

[18] C. Bizer, R. Cyganiak, Quality-driven information filtering using the WIQA policy framework, Journal of Web Semantics 7 (1) (2009) 1–10. [19] J. Debattista, S. Auer, C. Lange, Luzzu–a framework for linked data quality assessment, in: Proceeding of the IEEE International Conference on Semantic Computing (ICSC), 2016, pp. 124–131.

715

[20] O. Pivert, V. Thion, H. Jaudoin, G. Smits, On a Fuzzy Algebra for Querying Graph Databases, in: Proceedings of the IEEE International Conference on Tools with Artificial Intelligence (ICTAI), 2014, pp. 748–755. [21] O. Pivert, O. Slama, V. Thion, Fuzzy Quantified Structural Queries to Fuzzy Graph Databases, in: Proceedings of the International Conference

720

on Scalable Uncertainty Management (SUM), 2016. [22] A. Castelltort, A. Laurent, Fuzzy queries over nosql graph databases: perspectives for extending the cypher language, in: International Conference on Information Processing and Management of Uncertainty in KnowledgeBased Systems, Springer, 2014, pp. 384–395.

725

[23] AllegroGraph web site, franz.com/agraph/allegrograph. 33

[24] InfiniteGraph web site, www.objectivity.com/infinitegraph. [25] Sparksee web site, sparsity-technologies.com. [26] O. Pivert, O. Slama, V. Thion, SPARQL Extensions with Preferences: a Survey, in: Proceedings of the ACM Symposium on Applied Computing, 730

2016. [27] A. Castelltort, A. Laurent, O. Pivert, O. Slama, V. Thion, Fuzzy Preference Queries to NoSQL Graph Databases, in: O. Pivert (Ed.), NoSQL Data Models – Trends and Challenges, Vol. Chapter 6, 2018. [28] L.A. Zadeh, Fuzzy sets, Information and control 8 (3) (1965) 338–353.

735

[29] D. Dubois, H. Prade, Fundamentals of fuzzy sets, Vol. 7 of The Handbooks of Fuzzy Sets, Kluwer Academic, 2000. [30] A. Castelltort, A. Laurent, Fuzzy Queries over NoSQL Graph Databases: Perspectives for Extending the Cypher Language, in: Proceedings of the Information Processing and Management of Uncertainty in Knowledge-Based

740

Systems (IPMU), 2014, pp. 384–395. [31] T. C. Redman, Data Quality for the Information Age, Artech House Inc., 1996. [32] P. Barcel´ o Baeza, Querying graph databases, in: Proceedings of the ACM Symposium on Principles of Database Systems (PODS), 2013, pp. 175–188.

745

[33] P. Barcel´ o, L. Libkin, J. L. Reutter, Querying regular graph patterns, Journal of the ACM 61 (1) (2014) 8:1–8:54. [34] M. Y. Vardi, On the complexity of bounded-variable queries, in: Proceedings of the ACM Symposium on Principles of Database Systems (PODS), 1995, pp. 266–276.

750

[35] J. Lee, W.-S. Han, R. Kasperovics, J.-H. Lee, An in-depth comparison of subgraph isomorphism algorithms in graph databases, Proceedings of the Very Large Database Endowment (PVLDB) 6 (2) (2012) 133–144. 34

[36] P. Bosc, L. Li´etard, O. Pivert, Evaluation of flexible queries: the quantified statement case, in: Proceedings of the International Conference on Infor755

mation Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU), 2000, pp. 1115–1122. [37] R. Yager, Social network database querying based on computing with words, in: Flexible Approaches in Data, Information and Knowledge Management, Studies in Computational Intelligence, Springer, 2013.

760

[38] F. Radulovic, N. Mihindukulasooriya, R. Garc´ıa-Castro, A. G´ omez-P´erez, A comprehensive quality model for linked data, Semantic Web 9 (1) (2018) 3–24. [39] B. Daneshvar Rouyendegh (Babek Erdebilli), The intuitionistic fuzzy ELECTRE model, International Journal of Management Science and En-

765

gineering Management 13 (2) (2018) 139–145. [40] Z. Wu, J. Xu, Possibility distribution-based approach for MAGDM with hesitant fuzzy linguistic information, IEEE Trans. Cybernetics 46 (3) (2016) 694–705. [41] H. Liao, X. Wu, X. Liang, J. Xu, F. Herrera, A new hesitant fuzzy linguistic

770

ORESTE method for hybrid multicriteria decision making, IEEE Trans. Fuzzy Systems 26 (6) (2018) 3793–3807. [42] Z. Wu, B. Jin, J. Xu, Local feedback strategy for consensus building with probability-hesitant fuzzy preference relations, Appl. Soft Comput. 67 (2018) 691–705.

35

775

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. The authors declare the following financial interests/personal relationships

780

which may be considered as potential competing interests:

36