A uniform methodology for extracting type conflicts and subscheme similarities from heterogeneous databases

A uniform methodology for extracting type conflicts and subscheme similarities from heterogeneous databases

InformationSystems Vol. 25, No. 8, pp. 527-552,200O 0 2001 Elsevier Science Ltd. All rights reserved printed in Great Britain 0306-4379/00 $20.00 PB: ...

2MB Sizes 0 Downloads 44 Views

InformationSystems Vol. 25, No. 8, pp. 527-552,200O 0 2001 Elsevier Science Ltd. All rights reserved printed in Great Britain 0306-4379/00 $20.00 PB: s0306-4379(00)00034-x

Pergamon

A UNIFORM METHODOLOGY FOR EXTRACTING TYPE CONFLICTS AND SUBSCHEME SIMILARITIES FROM HETEROGENEOUS DATABASES GIORGIO Dipartimento

di Elettronica,

Informatica

and DOMENICO

TERRACINA e Sistemistica,

Universita

URSINO

degli Studi della Calabria,

Via Pietro Bucci,

87036 Rende (CS), Italy (Received 29 April 1999; in final revised form 6 October 2000) Abstract Cooperative Information Systems have been proposed to allow a uniform access to heterogeneous data yet preserving their operational autonomy. They use global dictionaries defined on the basis of interscheme properties; these include nominal and structural properties, type conflicts and object cluster similarities. Whereas in the literature a certain number of techniques has been proposed for deriving nominal and structural properties, few approaches exist for detecting type conflicts and object cluster similarities. The type of an object indicates if it is an entity, a relationship or an attribute; type conflicts indicate the existence of objects representing the same concept yet having different types. Object cluster similarities denote similitudes between portions of different schemes. This paper proposes an automatic, probabilistic approach to the detection of type conflicts and object cluster similarities in database schemes. The method we are proposing here is based on considering pairs of objects having different types (resp., pairs of clusters), belonging to different schemes and on measuring their similarity. To this purpose object (resp., cluster) structures as well ss obiect (resp.. cluster) neighborhoods are analyzed-to verify similitudes and differences. A number of examples show; the suitability of our techniques to effectively detect type conflicts and object cluster similarities. 0 2001 Elsevier Science Ltd. All rights reserved Key words: Type Conflict Detection in Database Schemes, Derivation of Similarities between Database Subschemes, Automatic and Semantic Approaches for Detecting Interscheme (i.e., Nominal and Structural) Properties from Heterogeneous Databases

1. INTRODUCTION In the last years, an enormous increase of data available in electronic form has been witnessed, as well as a corresponding proliferation of query languages, data models and systems for data management. Nowadays, heterogeneous data management systems often coexist within the same operational environment. Traditional approaches to data management do not seem to guarantee, in these cases, the needed level of access transparency to stored data while preserving the autonomy of local databases. This situation contributed to push the development of new approaches which led to the design of Cooperative Information Systems (CIS) [14, 17, 25, 271 and Data Warehouses (DW) [ll, 261. CIS and DW indeed allow for users to query pm-existing autonomous data sources in a way that guarantees model, language and location transparency. A main problem to be solved in designing such access systems relies in scheme integration, i.e., the activity by which different input database schemes are merged into a unique global structure describing the whole information set available for the query purposes. For very large systems, however, scheme integration alone typically ends up with producing a too complex global scheme that may, in fact, fail to supply a satisfactory and convenient description of available data. In these cases, scheme integration steps must be completed by executing some scheme abstraction steps [l]. Carrying out scheme abstraction amounts to clustering objects belonging to a scheme into homogeneous subsets and producing an abstracted scheme obtained by substituting each subset S with one single object representing S. In order for scheme integration and abstraction to be correctly carried out, the designer has to clearly understand the semantics of involved database schemes. In such system re-engineering problems, the design emphasis is on the integration of preexisting information components, where a key problem is that of deriving relations holding among objects in preexisting schemes [l]. Therefore, methodologies are needed to extract properties from schemes. The most interesting of these are inter-scheme properties that relate objects belonging to different schemes [l, 2, 4, 5, 6, 7, 8, 9, 10, 12, 15, 16, 18, 19, 23, 241. These can be classified into: 527

GIORGIO TERRACINA and DOMENICO URSINO

528

l

Nominal: these are synonymies (i.e., two objects have the same meaning and the same typet), homonymies (i.e., two objects have the same name and the same type but different meanings) and hypernymies (i.e., one object has a more specific meaning than another one of the same type).

.

Structural: these denote

.

Type Conflicts: a type conflict between two objects indicates that they represent the same concept, yet having different types (e.g., one is an entity and the other is a relationship, one is an attribute and the other is an entity, and so on). The existence of a type conflict between two objects denotes that they are two different representations of the same portion of the reality; such conflicts derive from the fact that different designers may have different perceptions of the same reality.

.

Object Cluster Similarities: an object cluster represents a set of connected scheme, i.e., a subscheme; an object cluster similarity denotes the similitude portions of two different schemes.

inclusions

of an object

into another

one.

objects in a between two

In the literature a certain number of approaches for extracting nominal and structural properties has been proposed [4, 5, 6, 8, 12, 15, 22, 241; on the contrary, there is a small number of proposed approaches for discovering type conflicts [15, 21, 241 and object cluster similarities [15, 20, 241. When the amount of involved data is huge and/or when they change quite frequently over time, manual approaches to derive interscheme properties do not suffice and the necessity arises of automatic techniques. Indeed, automatic approaches are much less time consuming than manual ones and so a try-and-check methodology for the extraction of interscheme properties can be successfully adopted. Finally, updating and maintaining dictionaries obtained by an automatic extraction process is simpler. In the literature only some of the proposed approaches for deriving interscheme properties are automatic; most of them are, in fact, manual. In this paper we propose a novel automatic technique for detecting type conflicts and object cluster similarities in database schemes defined by E/R diagrams. The approach is based on computing a similarity degree between pairs of objects 01 E S1 and 02 E 5’2 having different types (for type conflict detection) or pairs of object clusters 6’1 E Sl and C2 E S2 (for object cluster similarity detection). Similarity degrees are expressed using values in the real interval [0,11. Type conflicts and object cluster similarities are detected in the first place as probabilistic properties of scheme objects. Then, only those characterized by a sufficiently high similarity degree are returned as actual detected conflicts. In order to derive similarity degrees associated to type conflicts (object cluster similarities, resp.), for each scheme object (cluster, resp.) our algorithm considers its structure and its context 18, 221. The structure of an object consists of its attributes whereas the structure of a cluster consists of the objects belonging to it. The context of an object 0 (resp., of a cluster C) indicates the objects belonging to the neighborhood (see below) of 0 (resp., of C). Most of the automatic approaches proposed in the literature for deriving interscheme properties are based on comparing structures (i.e., object attributes); only few of them consider also object contexts. Our approach exploits structures and contexts in deriving similarity degree because we argue that (i) objects having the same real world semantics are very often characterized by the presence of common elements in their context; (ii) it is generally accepted that similarity techniques derived taking into account contexts are more precise than techniques based only on the attribute analysis [6, 8, 221. Finally, for attributes we have to take into account their semantic relevance, i.e., the relevance in distinguishing the semantics of entities/relationships they belong to [12]. For example the attribute Salary is more relevant than the attribute Identifier for distinguishing the entity Employee from the entity Car. The approach presented in this paper uses some weights and thresholds; we have exploited several example cases for tuning up their optimal values and for validating the approach itself after ‘The type of an object object types are sometime

0 in an E/R scheme indicates if 0 is an entity, called meta-types in the literature.

a relationship

or an attribute.

Scheme

A Uniform Methodology

for Extracting

Type Conflicts and Subscheme Similarities

529

the tuning phase. The most interesting and complete test case has been the set of database schemes relative to Italian Central Governmental Offices (ICGO). ICGO own about 300 databases many of which having a complex structure and containing large amounts of data. First a tuning phase has been carried out for fixing weights and thresholds; then the algorithms have been applied and their results have been validated. To this purpose we have compared results produced by our approach with those obtained manually by “Italian Information Systems Authority for Public Administration” (AIPA); the comparison has demonstrated that our algorithms are capable of producing high quality results yet requiring a short amount of time. In more detail, we generally obtained the same results as those derived manually by AIPA; however AIPA took about three years for examining all ICGO databases and, after that period, some of the databases into consideration had been modified, other databases had been created and so on; therefore manual derivation of interscheme properties appeared to be not able to face ICGO database changings. Our approach could require several weeks for examining all ICGO databases; therefore, on the one hand, it appears appropriate for handling a great number of (possibly) large databases; on the other hand, it takes into account scheme semantics and obtained results are satisfactory and very similar to those derived by manual techniques. The description of one of the experiments we have carried out on ICGO databases can be found in Section 4. We point out that the approach described in this paper for type conflicts and object cluster similarities follows the same leading ideas of [22]. In particular, in [22] an algorithm is presented for deriving nominal properties. The algorithm consists of four steps. The first one derives the so called rough synonymies between entities, obtained taking into account object structures only. The second step computes synonymies between relationships using their context, rough synonymies derived during the previous step and lexicographic attribute synonymies. More refined values for entity synonymies, as well as entity homonymies, are searched for by exploiting context information that can be set up through rough synonymies between entities and synonymies between relationships obtained in the previous two steps. The fourth step is a filtering phase which discards synonymies whose plausibility is under a certain threshold. As a consequence, in the whole, we propose a

Definition 1 Let A be an attribute. The structure of A consists of A itself. The context of A consists of both the object 0, which A belongs to, and the structure of 0 (except, obviously, A itself), Let E be an entity. The structure of E consists of the set of its attributes. The context of E consists of relationships it takes part into along with their attributes and all other entities linked to these relationships along with their attributes. Let R be a relationship. The internal structure of R consists of its attributes. The structure of R consists of its internal structure plus the entities linked by R and their attributes. The context of R consists of entities linked (through some relationship) to entities belonging to the structure of R. In the sequel, we shall use the following support functions: EStructure(E) (resp., RStructure(R)), that takes in input an entity E (resp., a relationship R) and yields in output objects belonging to its structure; AContezt(A) (resp., (i) EContezt(E), (ii) RContezt(R)) that receives an attribute

GIORGIOTERRACINAand DOMENICOURSINO

530

A (resp., context. 2.2.

(e) an entity E, (ii) a relationship

General

R) and returns the set of objects

belonging to its

Characteristics

In this section we illustrate an algorithm for the derivation of type conflicts. Each type conflict will be denoted by a triplet of the form [A, B, fl, where A and B are the involved objects and f is a fuzzy coefficient, in the real interval [0, l], which expresses the plausibility of the property. The proposed algorithm activates some functions, one for each kind of type conflicts. In particular these functions are: (i) EA_E-Ccmf, which derives conflicts between entity attributes and entities; (ii) RAE_Conf, which derives conflicts between relationship attributes and entities; (iii) EA-R-Con f, which detects conflicts between entity attributes and relationships; (iv) RA_R_Ccmf, which determines conflicts between relationship attributes and relationships; (v) ER-Cmf, which detects conflicts between entities and relationships. For each object into consideration, the algorithm examines both its structure (taking also into account the semantic relevance of attributes forming it) and its context. The derivation of each kind of type conflicts generally consists of three steps, each of which is implemented by a suitable function. The first one derives the so-called rough type conflicts, taking into account only the structure of the involved objects. In the second step refined type conflicts are derived taking into account both rough type conflicts, detected in the previous step, and object contexts. Finally, the third step singles out the most interesting among the discovered type conflicts. The proposed algorithm exploits some basic background knowledge about involved databases consisting in: l

A Lexicographic Synonymy Property Dictionary LSPD, storing triplets of the form [A, B, f], where A and B are the involved names and f is a fuzzy coefficient, in the real interval [0,11, expressing the plausibility of the property. Constructing the LSPD consists of two steps: - stating lexical synonymies either by using a standard thesaurus (such as Wordnet) or by enquiring with experts of the domain; _ setting plausibility coefficients; since standard thesauri do not provide plausibilities synonymies, in this step, the support of domain experts is mandatory.

for

In order to obtain a high objectivity, groups of human experts can be asked to supply plausibility coefficients and mean values are assumed. In particular, human experts are required to specify plausibilities choosing among five possible values, namely H, MH, M, ML, L, to denote a High, Medium High, Medium, Medium Low and a Low plausibility, resp. These specifications correspond to values 1.0, 0.8, 0.6, 0.4 and 0.2, resp., of the corresponding plausibility coefficients. In real application settings, however, the possibility exists for errors to occur in plausibility values associated to lexicographic similarities provided by experts, although we are entitled to assume that possible errors do not involve most coefficients. By conducting a sensitivity analysis based on varying the values specified for lexicographic similarities, we have shown that, under the above mentioned condition, the results yielded by our techniques are not significantly influenced by ill-specified plausibility coefficients occurring in the LSPD. l

A Synonymy Property Dictionary SPD, B are involved objects and f is a fuzzy the plausibility of the property. In the extracting synonymy properties and for 20, 22, 241.

The main algorithm for extracting

storing triplets of the form [A, B, f ] , where A and coefficient, in the real interval [0, 11,that expresses literature some algorithms have been proposed for constructing a Synonymy Dictionary [6, 8, 12, 15,

type conflicts is as follows:

A Uniform Methodology for Extracting Type Conflicts and Subscheme Similarities

Algorithm

for Extracting

Type

531

Conflicts

Input: it list S of n database schemes; a dictionary SPD of synonymy properties and a dictionary LSPD of lexicographic synonymy properties; Output: a dictionary TCD of type conflicts; begin

TCD := EA_E_Conf

(S, SPD, LSPD)

EA_R_Conf

(S, SPD, LSPD)

ER_Conf(S,

SPD, LSPD)

u RAE_Conf(S,

U RARXonf

SPD, LSPD)

(S, SPD, LSPD)

u

u

end 2.3. Discovering Conflicts between Entity Attributes and Entities The function EA_E_Conf derives conflicts between entity attributes and entities. This derivation is obtained into two separate phases: the first phase determines rough type conflicts by taking into account only object structures, whereas the second phase extracts refined type conflicts by using both rough type conflicts and object contexts. In more detail the function is the following:

Function EA_E_Conf (S: a list of n database schemes; SPD: a Synonymy Property Dictionary; LSPD: a Lexicographic Synonymy Property Dictionary): return a set of type conflicts between entity attributes and entities; var RTC:

a set of (rough) type conflicts between entity attributes and entities;

TTC:

a set of type conflicts between entity attributes and entities;

begin RTC := EA_E_Rough(S,LSPD); TTC := EA_E_Refined(S,SPD,LSPD,RTC); TTC := EA_E_Discard_ Weak(TTC); return

TTC

end Next, we are going to give some details on the procedures listed above. 23.1.

Computing Rough Entity Attribute - Entity Conflicts

The function for computing rough conflicts between entity attributes and entities analyses pairs of distinct schemes. For each pair of objects in the two schemes, and representing an entity attribute and an entity, resp., the function checks in the LSPD for the presence of a lexicographic synonymy between names of both the entity attribute and the entity. The corresponding function is as follows: Function EA_E_Rough (S: a list of n schemes; LSPD: a Lexicographic Synonymy Property Dictionary): return a set of (rough) type conflicts between entity attributes and entities; var RTC: a set of type conflicts between entity attributes and entities; h, k : 1 .. n; begin

GIORGIOTERRACINAand DOMENICOURSINO

532 RTC := 0; for h:= 1 to n do

for k:= 1 to n such that h # k do for each pair consisting

of an entity

Ai E Sh and an entity

attribute

Ei E Sk

do

RTC := RTC U [Ai, Ej, LexSyn(Ai, return

Ej, LSPD)];

RTC

end

The function Lex-Syn checks, in the LSPD, the presence of a lexicographic synonymy between names of an attribute Ai and an entity Ej, which it receives in input, and yields in output its plausibility coefficient, if it holds, 0 otherwise.

2.3.2.

Computing Refined Entity Attribute - Entity Conflicts

This function derives a more refined value for the coefficient associated to each conflict between an entity attribute (say AE) and an entity (say E) by exploiting (i) rough type conflicts derived by the previous procedure, (ii) synonymies between attributes of the contexts of AE and E, (iii) synonymies between entities of the contexts of AE and E. The function is as follows: EA_E_Rejined(S: a list of n schemes; SPD: a Synonymy Property Dictionary; LSPD: a Lexicographic Synonymy Property Dictionary; RTC: a set of (rough) type conflicts between entity attributes and entities): return a set of type conflicts between entity attributes and entities; Function

VZll-

A, Q: a set of reals in the interval

[0,11;

h, k : 1.n; TTC:

a set of type conflicts

between

entity

attributes

and entities;

begin

TTC := 8; for h := 1 to n do for k := 1 to n such that h # k do for each pair consisting

of an entity

attribute

Ai E S,, and an entity

Ej E Sk

do begin

AA~E~ := EA_E_Att_Syn(Ai, QA;E, := EAEEnt_Syn(Ai,

Ej, SPD);

Let [Ai, Ej, ~A,E, 1 be a tuple

belonging

TTC := TTCU end; return end

Ej, LSPD);

TTC

to RTC;

[Ai, Ej, EA_ERefined_Val(AAiEj,Q~,Ej,

fA,E,)j

A Uniform Methodology for Extracting Type Conflicts and Subscheme Similarities

533

Function EAEAtt-Syn The function EA_E_Att_Syn computes synonymies between attributes belonging to the context of an attribute Ai and the context of an entity Ej, which it receives in input. This function is as follows: Function EA_E_Att_Syn(Ai: an entity graphic Synonymy Property Dictionary):

attribute; Ej: an entity; return a real E [O, 11;

LSPD:

a Lexico-

const w, = 1; var SP: a set of attribute

synonymies;

L, M : a set of attributes; F: a matrix

of reals;

begin SP := 0; L := { Al 1 Al is an attribute M := { A,

1A,

and Al E AContezt(Ai)

and A,,, E ECmtext(Ej)

is an attribute

Ai E L and A,

for each pair of attributes SP := SP U (AI, A,,

};

W_Mean(AI,

};

E M do

A,, LSPD));

SP := Obj_Discard(SP); F := Mat(L, M, SP); return Matching(L,

M, F, u,)

end The function W-Mean (already introduced in [22]) computes the synonymy coefficient associated to a pair of attributes (Al, A,) taken in input. To this end, three characteristics are considered, namely, attribute name, domain, and key characterization; the key characterization of an attribute tells if the attribute is a primary key (PK), a secondary key (SK) or not a key (NK) for the entity it belongs to. The function returns the result of the following expression: WN

x NA,A,

+

WD

x DA,A,

+

WK

x KA~A,,,

where WN, WD and WK are weighting factors, whose optimal values have been experimentally set to O&O.15 and 0.05, respectively; NA,A, is the value of the lexicographic synonymy between Ai LSPD; the value of DA,A, is set to 1 if the domains of and Am, as resulting from the dictionary AI and A, are the same, to 0.5 if they are compatible (default compatibilities are integers with reals and chars with strings; further, more specific, compatibilities can be provided by the human expert) and to 0 otherwise. Finally, K,+A, is set to 1 if Al and A, are both PK or both SK or both NK, to 0.5 if one of them is PK and the other is SK or if one of them is SK and the other is NK, to 0 if one of them is PK and the other is NK. The function Obj_Discard considers each tuple belonging to SP and normalizes the corresponding plausibility value, by setting it to 0 if this was previously under a certain given threshold th,. The optimal value of th, has been experimentally set to 0.25 (see the Introduction and Section 4 for the description of these experiments). The function Mat creates a matrix F having a row for each Al E L, a column for each A, E M and F[Al, A,] equal to the plausibility value associated in SP to the pair Al and A,. The function Matching computes a factor obtained from calculating a maximum weight matching. The input here are two sets of objects L = (11,. . . ,l,} and M = {ml,. . . , m,} and a cost matrix F on L and M such that, for each la E L and mj E M, 0.0 5 F[l,,mj] 5 1.0. The output is a value v in the real interval [0,11. If L = 0 or M = 0 then Matching returns TV= 0. Otherwise, let BG = (L U M, A) be a bipartite weighted graph, where A is the set of weighted edges

534

GIORCIOTERRACINAand DOMENICOURSINO

{ (Zi, mj, fij) ) fij > 0); the maximum weight matching for BG is a set A’ C A of edges such that for each node x E LUM there is at most one edge of A’ incident onto x and d(A’) = &;,mj,fijJEA, fij Now, let s(A’)

is maximum+.

= $$.

Th e value TJreturned

by Matching is defined

abs(lLI-IM1)+2x(min(II;I,JM()-IA’lr lLl+lMI

where w, is a constant has been experimentally Function

used to weight the importance set to 1.

of unrelated

>

as:

x ;i;W

edges.

The optimal

value of wV

EA_EXnt_Syn

The function EA_E_Ent_Syn takes in input an attribute Ai, an entity Ej and a dictionary SPD of synonymy properties and returns a synonymy coeflicient as the maximum of plausibility coefficients relative to all entities related to the entity Ej and the entity which Ai belongs to. More specifically the function is the following: Function

Property

EA_E_Ent_Syn(Ai: an entity attribute; Dictionary): return a real E [0,11;

Ej: an entity,

SPD:

a Synonymy

var

SP: a set of entity

synonymies;

begin SP := 0; Let El be the entity for each entity

in the AContext(

E,,, E EContext(Ej)

if LEl, Em, ~E[E,J E SPD

do

then

SP := SP u (El, Em, LYE,) else

SP := SP U (El, E,, 0); return

Max(SP);

end

The function Max takes in input the set SP of synonymies the maximum of all plausibility values. Function

between

objects

and yields in output

EAERefined_Val

The function EA_E_Refined_Val takes in input the real numbers AA;E,, QA;E, and f&E, and computes their weighted mean with the coefficients VA = 0.3, VQ = 0.25 and Vf = 0.45. 2.3.3.

Discarding Weak Type Conflicts

EA_E_Discard_ Weak takes in input a set TTC of (candidate) type conflicts bety;;oentity attributes and entities. It computes an interest threshold: th,, = max (6, TTCMAx MuIL), where TTCMAX and TTCM~N are the maximum and the minimum plausibility val& in TTC and 6 is a real number E [O, l] whose value has been experimentally set to 0.6. The function then discards all type conflicts whose coefficient is smaller than th,,. As already stated, all type conflicts not discarded by EA_E_Discard_ Weak are added to the Type Conflict Dictionary TCD.

A Uniform Methodology

for Extracting

Type Conflicts and Subscheme Similarities

535

33

(1.1)

Idmtitia

Duration

warranty

Fig. 1: Scheme AD: the Administration

2.3.4.

Department

Database

An Example

Consider the schemes in Figures 2 and 1 representing, respectively, Administration AD) and Production (denoted PD) departments of an organization. Assume the following tuples belong to LSPD: [Id, SSN, 0.81 [BirthDate, Date, 0.91 [Name, Name, 11 [Sala y, Stipend, 0.91 [Birthplace, District, 0.451 (Reference-Area, Reference-Section,

(denoted

[Id, Name, 0.1 [Name, SSN, 0.1 [Home, Address, 0.91 [Birthplace, Zip Code, 0.4 [BirthPlace, Town, 0.81 0.81

We show how the function EAE_Conf works for schemes given in the example. Due to space limitations we consider only the pair consisting of the attribute Birthplace of scheme AD and the entity Town of scheme PD. The function EA_ERough inserts the tuple [Birthplace,

Town, LexSyn(BirthPlace,

Tmn,

LSPD)

= 0.81

into RTC. Consider now EAERefined. It begins with calling EAEAtt_Syn. When this is activated it constructs: L := {Id, BirthDate, Name, Home, Salary} M := {SSN, Name, Address, Function, Stipend, Date} For each pair of attributes Al E L and A, E M, their plausibility coefficient is derived by applying the function W-Mean and the corresponding tuple is inserted into SP. Tuples constituting SP are. therefore: tA survey on maximum weight matching can be found in [13].

GIORGIO TERRACINAand DOMENICOURSINO

536

Cod.5

Zip-Code NUIK Rcsident_Number District

NZUM

Reference_Area Occupied_Suface

Fig. 2: Scheme PD: the Production

(Id, SSN, 0.84) (Id, Function, 0.15) (Birthdate, SSN, 0) (BirthDate, Function, 0.05) (Name, SSN, 0.74) (Name, Function, 0.18) (Home, SSN, 0.15) (Home, Function, 0.2) (Salary, SSN, 0) (Salary, Function, 0.05)

Department

(Id, Name, 0.74) (Id, Stipend, 0) (BirthDate, Name, 0.03) (BirthDate, Stipend, 0.05) (Name, Name, 1) (Name, Stipend, 0.03) (Home, Name, 0.18) {Home, Stipend, 0.05) (Salary, Name, 0.03) (Salary, Stipend, 0.92)

Database

(Id, Address, 0.15) (Id, Date, 0) (BirthDate, Address, 0.05) (BirthDate, Date, 0.92) (Name, Address, 0.18) (Name, Date, 0.03) (Home, Address, 0.92) (Home, Date, 0.05) (Salary, Address, 0.05) (Salary, Date, 0.05)

The function Obj-Discard considers each tuple in SP; for those having less than 0.25 it modifies the plausibility coefficient setting it to 0. The function Mat yields in output the following matrix:

F=

i

0.84 0 0.74 0 0

0.74 0 1 0 0

0 0 000 0 0 0.92 0 0 0

0 0 0 0.92

= w

= y

= 0.92

coefficient

0 0.92 0 0 0

The function Matching computes the maximum weight matching: A’ = {(Id, SSN, 0.84), (BirthDate, Date, 0.92), (Name, Name, (Salary, Stipend, 0.92)) $(A’)

a plausibility

l), (Home,

Address,

0.92),

v = (1 - $ x i”i) x 0.92 = 0.88

Then, the function EA-E-Ent-Syn is activated. The entity belonging is Employee; the entity belonging to EContext(Town) is Subordinate.

to ilContext(BirthPlace) Any algorithm computing

A Uniform Methodology for Extracting Type Conflicts and Subscheme Similarities

537

entity synonymies finds a synonymy to hold between these entities. For example the algorithm in [22] returns a synonymy with a plausibility coefficient equal to 0.57; suppose this synonymy is present, in SPD; when the function EA_E_EntSyn is called, (Subordinate, Employee,0.57) is stored in SP; consequently Max(SP) = 0.57 and this value is returned by the function. Therefore EA_ERefined_Val returns 0.3 x 0.88 + 0.25 x 0.57 + 0.45 x 0.8 = 0.77 and the tuple [Birthplace, Town, 0.771 is stored in the TTC. The function EA_EDiscard_Weak considers this property valid and stores the corresponding tuple in the Type Conflict Dictionary TCD. 2.4. Discovering

Conflicts between Entities and Relationships

The function ER_Conf be encoded as follows:

derives conflicts between entities and relationships. The function can

Function E_R_Conf (S: a list of n database schemes; SPD: a Synonymy Property Dictionary; LSPD: a Lexicographic Synonymy Property Dictionary): return a set of type conflicts between entities and relationships; VaF

RTC:

a set of (rough) type conflicts between entities and relationships;

TTC:

a set, of type conflicts between entities and relationships;

begin RTC := ERRough(S,

LSPD);

TTC := E_RRe f ined(S, SPD, RTC); TTC := E_RDiscaTd_Weak(TTC); return TTC end The three procedures called by ER_Conf 24.1.

Computing Rough Entity-Relationship

are illustrated in the following subsections. Conflicts

The function E-R-Rough proceedes by analysing pairs consisting of an entity and a relationship, belonging to different schemes and, for each pair, computes a coefficient stating the synonymy degree of the involved objects. To this purpose, it considers all possible pairs of attributes, one belonging to the entity structure and the other to the relationship structure and, for each pair, computes the associated similarity degree. The function is encoded as follows: Function E_R_Rough(S: a list of n database schemes; LSPD: a Lexicographic Synonymy Property Dictionary): return a set of (rough) type conflicts between entities and relationships; const

SP: a set of synonymies between attributes; RTC:

a set of type conflicts between entities and relationships;

h, k : 1 . . n; L, M: a set of attributes; F: a matrix of reals; begin

GIORGIOTERRAGINAand DOMENICOURSINO

538

RTC := 0; for h:= 1 to n do for k:= 1 to n such that h # k do for each pair consisting

of an entity

Ei E Sh and a relationship

Rj E Sk do

begin SP :=

0;

L := { AI 1AI is an attribute M := { A,

1A,

and Ai E EStructwe(Ei)

is an attribute

for each pair of attributes SP := SP U (Al, A,,

and A,

Al E L and A, W_Mean(Al,

E RStructwe(Rj)

}; };

E M do

A,, LSPD,w,));

SP := ObjDiscard(SP); SP := Att_Relevance(SP); F := Mat(L, M, SP); RTC I= RTC U [Ei, Ri, Matching(L,

M, F, w,)]

end; RTC := Normalize(RTC); return RTC end Functions W-Mean, Mat and Matching are the same as those in EA_E_Att_Syn. The function Att_Relevance takes in input a set SP of synonymies between attributes and performs a normalization of corresponding plausibility coefficients taking into account the semantic relevance of the involved attributes, i.e., their relevance in distinguishing the semantics of objects they belong tot [12]. The leading idea is that, in the definition of the object semantics, primary key attributes are less influencing than candidate non-primary key attributes that are in turn less influencing than non-key attributes. More in particular, for each tuple (AI, A,, ~,J,A,,,) E SP, the function multiplies ~A,A,,, by a suitable coefficient CA,A, which takes into account the semantic relevance of the attribute pair. Values for CA,A, have been obtained experimentally; in particular CA~A, is equal to: (i) 0.4, if both Al and A, are primary keys; (ii) 0.6, if an attribute is a primary key and the other is a non-primary key; (iii) 0.65, if an attribute is a primary key and the other is not a key; (iv) 0.8, if both Al and A, are non-primary keys; (v) 0.9, if an attribute is a non-primary key and the other is not a key; (vi) 1, if both Al and A, are not keys. The function Normalize takes in input a set RTC of type conflicts between entities and relationships and performs a final normalization on the set of coefficients provided as its input. Indeed, because of the computation performed by the function Att_Relevance, the synonymy coefficients we have computed for type conflicts are underestimated. Therefore the function Normalize sets the plausibility coefficient f of each entry of RTC to N x f, where N is a suitable normalization coefficient. Due to space limitations we cannot describe here how N is obtained.

24.2. Computing Refined Entity Relationship Conflicts This function is devoted to derive more refined entity-relationship the algorithm exploits both previously derived rough type conflicts texts. The function is as follows:

type conflicts. To this end, and synonymies between con-

Function E-R-Refined (S: a list of n database schemes; SPD: a Synonymy Property Dictionary; RTC: a set of (rough) type conflicts between entities and relationships): return a set of type conflicts between entities and relationships; tFor example, as said in the Introduction, the attribute distinguishing the entity Employee from the entity Car.

Salary is more relevant than the attribute

Identifier

for

A Uniform Methodology

for Extracting

Fig. 3: Two Different Representations

A,Q: a set of reals in the interval

539

Type Conflicts and Subscheme Similarities

of Marriages

[0, 11;

h,k : l..n; TTC:

a set of type conflicts

between

entities

and relationships;

begin TTC

:= 0;

for h:= 1 to n do for k:= 1 to n such that h # k do for each pair consisting

of an entity

Rj E Sk do

h’i E Sh and a relationship

begin QE;R~ := E_R_Ent_Syn(Ei,

Rj, SPD);

Let [Ei, Rj, fEiRj 1 be a tuple TTC

belonging

to RTC;

:= TTC U [Ei, Rj, E_R_Refined_Val(fE;R,,

QE;R~)~

end; return TTC end The function being that: -L:={El(E’

EREnt-Syn I 1s an entity

- M := { E,,, 1E, The function

to the function

and El E EContext(Ei)

is an entity

E-R-Refined-Val

2.4.3. Discarding

is analogous

and E,

is analogous

EAE-Att-Syn,

the only differences

};

E RCmtext(Rj)

}

to EA_E_Refined_Val

but now Vf = 0.45 and VQ = 0.55.

Weak Type Conflicts

The function E-R-Discard- Weak takes in input a set TTC of type conflicts between relationships; it is analogous to EA_E_Discard_ Weak; in particular the threshold th,, set to 0.6. to max (y, TTCMax~TTC~rN ) , where y h as b een experimentally

entities and is set equal

2.4.4. An Example As an illustrating example of our methodology, consider taken from [24]): Assume that the following tuples belong to LSPD: [Date, Date, l]

the schemes shown in Figure 3 (example

[Date, BirthDate,

0.91

We show how the function E_R_Conf works for schemes given in the example. We consider only the entity EMarriage and the relationship RMarriage. Consider the function E-R-Rough. Here: M := {Date, Name, BirthDate} L := {Date, Location, Contract} For each pair of attributes Al E L and A, E M the plausibility coefficient is derived by applying the function W-Mean, and the corresponding tuple is inserted into SP. Tuples included in SP are, therefore:

GIORGIO TERRACINA and DOMENICO URSINO

540

(Date, Date, 1) (Location, (Contract,

The application

Date, 0.05) Date, 0.05)

of functions

(Date,

(Location, (Contract, Obj-Discard,

Att_Relevance

1 ( The activation

of the function

Matching

(Location, (Contract,

Name, 0.09) Name, 0)

F=

[EMarriage,

(Date, BirthDate,

0)

Name,

0

0.05) 0.05)

and Mat yields in output:

0.92

00

0

00

0

returns

0.92)

BirthDate, BirthDate,

1

(1 - 0.5 x f) x 1 = 0.67 and the tuple

RMarriage,

0.671 is stored in the RTC. Next we apply the function E-R-Refined on EMarriage and RMarriage. the function E_R_Ent_Syn is invoked; the function determines:

At the beginning,

121 := {Person}

L := {Person}

The application of the procedure for detecting entity synonymies described in [22] yields in output [Person, Person, O.SS]; therefore the function returns QE,unrringe Rh,arringe = 0.66; now the function E-RRefined-Val is activated which returns 0.45 x 0.67+0.55 x 0.66 = 0.66; therefore the tuple [EMarriage, RMarriage, 0.661 is stored in the TTC. Finally the function E-R-Discard- Weak judges that there is a conflict between Ehlarriage and RMarriage; therefore the corresponding tuple is returned as a result by E-R-Conf and is added to the Type Conflict Dictionary. 2.5.

Discovering

Other

Type Conflicts

Discovering

conflicts between relationship attributes and entit,ies is carried out by the function this is very similar to EA-E-Conf, the only relevant difference is that it considers the structure and the context of a relationship attribute instead of the structure and the context of an entity attribute. Discovering conflicts between entity attributes and relationships is carried out by the function EA-R-Conf; this is analogous to EA-E-Conf, except that it considers relat,ionship structure and context instead of entity structure and context. Discovering conflicts between relationship attributes and relationships is carried out by the function RA-R-Conf; this is analogous to RA_E_Conf, the only relevant difference is that it considers the structure and the context, of the relationship instead of the structure and t~he context of an entity. RA-E-Conf;

3. EXTRACTION 3.1.

OF SIMILARITIES

AMONG

OBJECT

CLUSTERS

Preliminaries

In the previous sections we have shown algorithms for recognizing type conflicts between objects. However there are many situations where a certain portion of the world representation is incorporated as two different sets of objects (hereafter called object clusters) within two schemes. In order to single out such similarities we have designed a second group of algorithms which are presented in this section. Preliminarly consider the following definitions: Definition 2 An object cluster is either (i) a relationship cluster. or (ii) a subset cluster, or (iii) a composed cluster. A relationship cluster is a portion of an E/R scheme consisting of a relationship, the entities it links and all involved attributes. A subset cluster is a portion of an E/R scheme consisting of an entity. the entities linked to it by an is-a relationship and all involved attributes. A composed cluster is a cluster including in turn two object clusters sharing at least, one entity.

A Uniform Methodology

for Extracting

Type Conflicts and Subscheme Similarities

541

Definition 3 Given an object cluster C, the structure of C consists of objects (attributes, entities and relationships) belonging to C; the context of C consists of the difference between the union of the cluster’s object contexts and objects belonging to the cluster itself (and therefore belonging to the cluster’s structure). In the sequel, we will use functions CStructure(C) and CContext(C), which take in input a cluster C and return the objects belonging to its structure and to its context, resp. For each cluster, our algorithm takes into account both its structure and its context. The algorithm exploits a Dictionary of Lexicographic Synonymy Properties LSPD and a Dictionary of Synonymy Properties SPD (see Section 2.2 for details about these dictionaries). It can be encoded as follows: Algorithm

for extracting object cluster similarities

Input: a list 5’ of n database schemes; a dictionary SPD dictionary LSPD of lexicographic synonymy properties;

of synonymy

Output:

clusters;

OCSD

a dictionary

of similarities

between

object

properties;

a

var ROCS:

a set of (rough)

OCPS:

a set of (interesting)

similarities

between

pairs of object

object

clusters;

clusters;

begin OCPS

:= Select-Interesting_Clusters(S,

ROCS

:= OC_Rough(S,

OCSD

:= OC-Ref

OCSD

:= OC_Discard_Weak(OCSD);

OCPS,

ined(S,

SPD,

OCPS,

SPD); LSPD);

SPD,

ROCS);

return OCSD end In the following 3.2.

Selecting

sections

Interesting

we will describe

into detail the functions

used in the algorithm.

Clusters

The function for selecting interesting clusters is devoted to limit the (potentially exponential) number of possible clusters by determining the most interesting, i.e., the most promising ones. An object cluster can be considered as the minimum subscheme containing two “seed” entities, called seed-entl and seed-ent2. The term “minimum subscheme” indicates the subscheme containing the minimum number of objects, among those comprising both seed-entl and seed_entZ If more than one subscheme exists with the same (minimum) number of objects, that having the maximum number of objects involved in synonymies stored in the SPD is chosen. If more than one exists with this characteristic, that having the greatest plausibility coefficients, associated to the synonymies which the objects belonging to it are involved in, is taken. For each pair of schemes Sh and Sk belonging to S, the function determines all pairs of object clusters (Ci, Cj) such that Ci E Sh, Cj E Sk and both Ci and Cj have been judged promising. More in particular, in a first step, object clusters of Sh are those having, as seed_entl, an entity E, involved in a significant entity synonymy and, as seed_ent2, any other entity El of Sh; given an object cluster in Sh having E, and El as the seeds, corresponding object clusters in Sk are those having, as seed_entl, any entity E, related to E, by a significant synonymy and, as seed-entb, any entity E, related to El by a significant synonymy. The second step is dual to the first one. The function can be encoded as follows: Function Select_Interesting_CIusters(S: a list of n database schemes; SPD: a Synonymy Property Dictionary): return a set of (interesting) pairs of object clusters;

GIORGIO TERRACINA and DOMENICO URSINO

542

var

OCPS:

a set of (interesting)

pairs of object

clusters;

begin for each [IS,., E,, ~E,E,J E SPD do begin

Let Sh be the scheme of E, and let Sk be the scheme of E,; for each entity

El E Sh do

for each [El, E,,

~E,E,,,J E SPD such that E,

C, := Minimum_Cluster(E,,

El);

Cj := Mininl7Lm_Cluster(E,,

E,,);

OCPS

E Sk do begin

Z= OCPS U (Ci, Cj)

end; for each entity

Ep E Sk do

for each [Ep, E,, f E,E,J E SPD such that Eq E Sh do begin Ci := Minimum_Cluster(E,,

Eq);

Cj := Minimum_Cluster(E~,

Ep);

OCPS

1= OCPS U (Ci, Cj)

end end end The function Minimum_Cluster takes in input two seed entities Ei and Ej and yields in output the object cluster (i.e., the “minimum subscheme”) which both E, and E:, belong to.

3.3. Extracting Rough Synonymies between Object Clusters The function for computing rough object cluster similarities determines, for each pair of object clusters, a coefficient stating the associated similarity degree; this is a weighted mean of similarity degrees of attributes, entities, relationships and cardinalities relative to the clusters of the pair. The function is as follows: Function OC_Rough(S: a list of n database schemes; OCPS: a set of (interest.ing) object cluster pairs; SPD: a Synonymy Property Dictionary; LSPD: a Lexicographic Synonymy Property Dictionary): return a set of (rough) object cluster similarities; var Q, T, U, V: a set of reals E [0,11;

TOGS:

a set of similarities

between

object

clusters;

begin

TOCS := 0; for each pair of object

clusters

(Ci, Cj) E OCPS

Qc,cj

Z= OC_RoughAtt_Syn(Ci,Cj,

Tc,cj

:= OC_Rough-Ent_Syn(Ci,

Cj, SPD);

uc,c,

:= OC_RoughRel_Syn(C,,

C,, SPD);

Vc;c,

:= OC_Rough_Card_Syn(S,

Ci, Cj);

do begin

LSPD);

TOCS := TOGS U IjCi, Ci, OC_Rough_Val(Qc,c,, end; return end

TOCS

Tc,c,, iJc,c,, I’c,c,)II

A Uniform

Methodology

for Extracting

Type

Conflicts

Functions OCRougkAttSyn, OC_Rough_Ent_Syn and function EA_E_AttSyn, the only differences being that: l

- L := {AI ( AI is an attribute 1A,

OC_Rough_RelSyn

are analogous

543

to the

and Ai E CStructure(Ci)}

is an attribute

and A,,, E CStructure(Cj)J

in OC_Rough_Ent_Syn: - L := {El 1El is an entity - M := {E, - SP:=U

l

Similarities

in OC-Rough_Att_Syn:

- M := {A, l

and Subscheme

1E,

and El E CStructure(Ci)} and E,,, E CStructure(Cj)}

is an entity

LEI,E,.IE,E,JESPD.E,EL.E,EM

in OC-Rough-Rel-Syn: - L := {RI 1RI is a relationship - M := {R, - SP:=U

j R,

and Rl E CStructure(Ci)} and R,

is a relationship

(fb,%n,

LR,.Rm.fR,RmlESPD,R,~L.Rm~M

E CStructure(Cj)} f&R,)

The function OC_Rough_Card_Syn takes in input a list S of schemes and two object clusters Ci and Cj; it considers cardinalities of relationships constituting the structure of clusters and returns a real value E [0,11;the higher the number of relationships of Ci having the same cardinalities of the corresponding synonym relationships of Cj is, the higher the value returned in output is. A simple implementation of the function should return the ratio between the number of synonym relationships having the same cardinality and the total number of synonym relationships. The function OC-Rough-Val takes in input four real coefficients: Qcicj, Tcicj, Ucicj and Vcicj and computes their weighted mean with the coefficients ZQ = 0.5, ZT = 0.3, ZU = 0.15 and 2” = 0.05. 3.4.

Extracting Refined Similarities between Object Clusters

The function for computing refined object cluster similarities is devoted to derive more refined cluster similarities. To this purpose it exploits entity and relationship synonymies, stored in the SPD, and rough object cluster similarities, stored in the ROCS. In more detail the function is as follows: OC_Refined(S: a list of n database schemes; OCPS: a set of (interesting) cluster pairs; SPD: a Synonymy Property Dictionary; ROCS: a set of (rough) cluster similarities): return a set of object cluster similarities;

Function

object object var

T, U: a set of reals E [0,11; TOCS:

a set of similarities

between

object

clusters;

begin

TOCS := 0; for each pair of object

Tcicj

clusters

(Ci, Cj) E OCPS

:= OC_RefinedEnt_Syn(Ci,

Ucicj := OC_RefinedRelSyn(Ci,

do begin

Cj, SPD); Cj, SPD);

Let ]Ci, Cj, fc;c, J be a tuple belonging

to ROCS;

TOCS := TOCS U IlCi,Cj, OC_Refined_VaZ(fc;cj, end; return end

TOCS

Tcicj, Uc;c,)ll

GIORGIO TERRACINA and DOMENICO URSINO

544

The function OC-Refined-Ent-Syn ferences being that: - L := {El 1E 1 1s an entity - M := {E, The function

1E,

is analogous

OC_Rough_Ent_Syn,

to the function

the only dif-

and El E CContext(Ci)} and E,

is an entity

OC-Refined-Rel-Syn

E CContext(C~)}

is analogous

OC_Rough_Rel_Syn,

to the function

the only differ-

ences being that: -

L := {Rl 1Rl is a relationship

- M := {R,

1R,

and RI E CContext(Ci)} and R,

is a relationship

E CContext(C,)}

The function OC-Rejined_Val takes in input three parameters fc,cj, Tc,c, and Uc>cj and computes their weighted mean with the coefficients Zf = 0.65, 2~ = 0.25 and 2~ = 0.1. 3.5.

Discarding

Weak Object

Cluster Sim,ilarities

The function OC_Discard_ Weak takes in input a dictionary OCSD of similarities between object clusters and discards the weak ones; the function is analogous to EA-E-DiscardWeak except that now the threshold th,, is equal to max (CT,OcsDMAx+ocsD~llN ), and CJ has been experimentally set to 0.55. Properties judged interesting by OC_&iscard_ Weak are stored in the final Object Cluster Similarity Dictionary. Some Examples

3.6. 3.6.1.

A First Example

Consider the schemes and the LSPD of Section 2.3.4. Let us apply the algorithm for computing the similarity degree associated to the pair of object clusters Cl = (Employee - Works-Division) and C, = (Subordinate - Operates - Department). At first, the function OC-Rough is called; it activates OC_Rough_Att_Syn which Qclcz = 0.81. After this, functions OC_Rouglb_Ent_Syn, OC_RoughRelSyn and OC_Rough_Card_Syn nally, OC-Rough-V&

returns

are invoked which return Tc,c, = 0.66, Uc1c2 = 0.57 and I/c1c2 = 1. Fireturns 0.5 x 0.81 + 0.3 x 0.66 + 0.15 x 0.57 + 0.05 = 0.75 and the tuple

/lCl, Cz, 0.7511 is inserted in the ROCS. After this, OC-Refined is called; it activates OC_RefinedEnt_Syn, which returns TcIcz = 0.25, and OC-RefinedRel_Syn, which returns Uclcz = 0.4; then, the function OC_Refined_Val is activated which returns 0.59; the tuple l/Cl, C2,0.5911 is added to the OCSD. Finally, the function 3.6.2.

OC-Discard-Weak

judges

the similarity

meaningful.

A Second Example

Consider

the schemes

in Figure

4, taken

from [24].

Ordered - Product) and Cz = (Customer - Places Assume that the LSPD stores the following tuples: [Nwme, Name, By applying the SPD:

the approach [Customer, [Ordered,

l]

[Date, ODate,

presented

Customer, Ordline,

0.851

[Quantity,

in [22], the following

0.72J

0.601

Consider

- Order

the cluster

- Ordline

Qty, 0.951

synonymies

Cl = (Customer

-

- Product).

[P#, P#,

l]

can be found and stored in

[Product,

Product,

[Ordered,

Places,

0.751 0.471

At the beginning, the funct,ion OC-Rough is activated. It, in its turn, calls OC_Rough.Att_Syn, which yields 0.86, OC_RoughEnt_Syn, which returns 0.66, OC-RoughRelSyn, which returns 0.40, and OC_Rough_Card_Syn, which yields 1. Finally, the function OC_Rough_Val returns 0.5 x 0.86 + 0.3 x 0.66 + 0.15 x 0.40 + 0.05 x 1 = 0.74 and the t,uplc Ill:, , Cz. 0.7411 is stored in ROCS. At this point, OCRefined should nevert,heless, extreme cases are:

be called.

In this example

no context

has been specified;

A Uniform Methodology

for Extracting

Type Conflicts and Subscheme Similarities

I

Name-

.

545

--op#

Tc,c, = 0 and UcIcz = 0, in which OC_Refined_Val returns 0.65 x 0.74 + 0 + 0 = 0.48; consequently the OCDiscard_Weak discards the corresponding similarity;

function

case the function

. the two contexts OC_Ref

are very similar so that Tc,c, = 1 and Uclca = 1, in which case the function returns 0.65 x 0.74 + 0.25 x 1+ 0.1 x 1 = 0.83; in this case the object cluster is accepted and stored in the corresponding dictionary by OC_Discard_Weak.

ined_Val

similarity

is able to automatically discover very difficult cluster This example shows that our approach _ _ similarities 1241. Moreover the example shows how our algorithm is sensible to the context (and therefore to the scheme semantics [B]); therefore its choices are not only syntactic but also semantic (see [6, 8, 221). 4. A REAL

CASE EXAMPLE

This section is devoted to show how our approach scales to real example cases. In particular we illustrate here one of the experiments we have carried out on ICGO databases. Database schemes of ICGO have been the main test set that we have used to verify the effectiveness of our techniques in deriving interscheme properties. This set includes heterogeneous databases, both in the data model and languages (e.g., hierarchical, network, relational), and as far as their structure and complexity are concerned (ranging from simple databases with schemes including few objects, to very complex databases). The ICGO databases into consideration are: Banca Dati Legge 185 - Credit0 Agevolato per piccole e medie imprese (Database of Facilitated Credit for small and medium firms), hereafter FCD. The scheme of this database is shown in Figure 5. Banca

Dati Legge 317 - Contributo

of Sunk database

Contributions for small is shown in Figure 6.

a Fondo

and medium

Banca Dati Legge 46 - Richieste Contributions), hereafter FRCD.

Albo delle Imprese

Figure

9.

(Firm Database),

per piccole

firms),

hereafter

e medie imprese SCD.

(Database The scheme of this

di imprese

per contributi (Database of Firm Requests The scheme of this database is shown in Figure 7.

Banca Dati Obiettivi CEE - Contributi

of European Union Contributions is shown in Figure 8.

Perduto

for

Unione Europea per Piccole e Medie Imprese

for small and medium

hereafter

FD.

firms).

(Database The scheme of this database

The scheme of this database

is shown in

GIORGIO TERRACINA and DOMENICO URSINO

546

Fig. 5: Database

of Facilitated

Credit

and Medium

Firms

(FCD)

I

I

Fig. 6: Database

for Small

of Sunk

Contribution

for Small

and Medium

Firms

(SCU)

A Uniform

Methodology

for Extracting

Fig. 7: Database

Fig. 8: Database

of European

of Firms

Type Conflicts

Requests

Union Contributions

and Subscheme

for Contributions

Similarities

(FRCD)

for Small and Medium Firms

(EUCD)

547

and DOMENICO

GIORGIO TERRACINA

548

URSINO

I Teehnicrl Manager

Historical Register

Company

Fig.

First

Obpxt

Producing Factory

9: Firm

Database

Second ActivityIF,,]

[FCD]

Contribution[Fc,pl Machine[FcDl FirmwDl Credit InstitutqFcDl Factorypcq FirmlFcDl Producing ActivityfFc,l Factorypcq Firm[FcD] Firmi.-cnl Factory[scq FiqscDl Producing Activityls,,l Factory,scD1 Firqscr,l FirqscDl Factory[FRcDl FirowRcDl Progra1q,,cLq FiroqFRCD] Firm[Eucnl Table

1: Synonymies

(FD)

Coefficient

Object

Producing Activity[sc,i Factory[scDl Contributionfscol Machinqscol Firm[scp Credit InstitutqscDl Factory[FRcol FirmpRcDl Producing Activity[El,CD] Factowpucq Firm[EucDl

ComPanY[F,] FactorYpRcD] Firm[FRcq

Producing ActivityLEucD] FactorylEucDl Firmwucol CompanylFDl FactorylEucDl Firm(Eucul Program(8acol ComPanYIFD] ComwnYfFDl

Derived

for ICGO

0.98 0.94 0.90

0.87 0.85 0.80 0.80 0.72 0.94 0.93 0.86 0.54 0.78 0.66 0.92 0.90 0.76 0.52 0.80 0.67 0.59 0.47 0.54

Schemes

A Uniform Methodology

y irst Object

Cluster

( Producing

Activity

ff

for Extracting

Carries Out c)

1 ( Producing

A Factory tf Carries Out tt et Prod&g Activity )~FCO~ ( Machine ++ For t-+ Factory t) et Carries Out ff Producing Activity )rpc~, ( Contribution ++ Asks For et Factory )lFc~l ( Contribution t) Asks For ct Factory ct ct Carries Out ct Producing Activity fs t-+ Machine t) For )IFCD, _ ( Machine tt For fs Factory +t tt Carries Out ct Producing Activity tf c) For et Contribution )lFcDl ( Contribution t) Asks For t) A Factory ts Carries Out tt et Producing Activity tt Machine tt

* For )[FCDJ

+t Carries Out ct

I ( Contribution 1 4-kF act orv t)

++ Asks For c1 Carries Out et ++ Prodoling Activity )ISCD~ ( Machine ct For c) Factory c) tf Carries Out ++ Producing Activity )I~c~, ( Contribution ++ Asks For t+ Factory ),sco, ( Machine e, For +t Factory ++ t, Carries Out et Producing Activity et cs For t) Contribution ),SCD, ( Contribution +t Asks For ++ Factory tt cs Machine +-+ For tt t) Carries Out t+ Producing Activity ),SCD, ( Contribution c1 Asks For e, & Factory CI Carries Out c1 ct Producing Activity t+ Machine tt

I 1 * For )[SCD~ I ( Contribution cf

Asks For e, t3 Factory ct Machine tt For cf ct Carries Out ++ Producing Activity

t)

++For )~CDI

( Contribution tt Asks For H t) Factory c) Machine t, For t) ct Carries Out tf Producing Activity ( Machine fs For tf Factory ++ tf Carries Out tt Producing Activity ff For tt Contribution )fFC~, . , ( Contribution tt Asks For c) tt Factory c1 Machine ++ For tf t) Carries Out C) Producing Activity ( Machine c) For fs Factory )LFCDI

1 0.93 I 0.89

I 0.87 0.86 0.85

0.85

0.85

I I

( Contribution c) Asks For ++ tf Factory fs Carries Out ++ t) Producing Activity tt Machine

fs

0.85 ct 0.85 tt

++For )ISCDI I ( Machine +) For ),FCDJ

I

tf Factorv c) ++ Carries Out tf Producing Activity tf tt For N Contribution )lscD, ( Producing Activity ct Carries Out cf tf Factory )~SCD~ Contribution tf Asks For ct tf Factory c) Machine ++ tt For tf Carries Out t) c) Producing Activity )[SCDI l Machine c) For +-t Factorv tt tt Carries Out tf Producing Activity f) tf For tf Contribution )[~col ( Factory tt Has tf Firm )[FRCD, ( Producing Activity e, Carries Out +t ++ Factory )IEUCDI ( Producing Activity ct Carries Out tf t-i Factory cf Has t-t

( Factory tt Has tt Firm )IFRCDI ( Producing Activity t) Carries Out ft * Factory )[EUCOI ( Factory t) Has c) Firm )[E~CDI Cluster

Similarities

Derived

for ICGO

0.85

)[SCD~

i+ For )ISCD~

( Firm tt Asks For ++ Factory )ISCD~ ( Producing Activity f) Carries Out tt ct Factory )lscD1 ( Factory c) Has tt Firm )[FRCD, 2: Object

( Contribution t) Asks For er tt Factory tt Carries Out tt t) Producing Activity tt Machine

)L~CDI

( Contribution C) Asks For C) c) Factory ++ Machine et cs For c* Carries Out c, et Producing Activity )[FCD) l Machine tf For tt Factorv ++ cf Carries Out +-+ Producing Activity cf ct For ct Contribution )~FCDI ( Factory f+ Asks For e, Firm )[FCD, ( Producing Activity t) Carries Out c) et Factory )IFCDI ( Producing Activity tt Carries Out c) et Factorv tt Asks For tt

Table

Activity

549

tt Factory ),scD,

cf Factory ),pcDl ( Contribution tt Asks For ct

( Contribution tt Asks For t) A Factory ++ Carries Out c) tf Producing Activity tt Machine

Type Conflicts and Subscheme Similarities

Schemes

I I

0.85

0.85 0.85

I I

0.85

0.80 0.90 0.85

0.66 0.88 0.76

GIORGIO TERRACINA and DOMENICO URSINO

550

By applying

the approach

presented

in [22], we derive the synonymies

shown in Table

The application of the algorithm for detecting type conflicts presented the presence of a type conflict between the attribute Category of the entity FRCD and the entity Category of the database EUCD. Finally, by applying object cluster similarities

lt.

in Section 2 derives Firm in the database

to the five ICGO schemes illustrated above the algorithm for detecting presented in Section 3 we have obtained the results shown in Table 2t.

5. RELATED

WORKS

In the literature other techniques have been proposed for the construction of global dictionaries describing inclusions, synonymies and other properties between scheme objects (e.g., [8]). Our methodology derives similar properties but also associates to each property a coefficient stating its plausibility. These coefficients allow each application exploiting interscheme properties (such as scheme integration and scheme abstraction) to obtain more reliable results. In the literature there are very few papers dealing with the detection of type conflicts (e.g., [4, 15, 241). In (41 a technique is proposed to solve all kinds of type conflicts, but this paper supposes type conflicts have been detected manually by human experts and, therefore, does not propose any approach for detecting them. On the contrary [24] proposes a technique for detecting type conflicts and object cluster similarities in a semi-automatic fashion; as illustrated also by various examples, techniques presented in this paper are able to detect all kinds of type conflicts derived by techniques described in [24], but in a more automatic fashion. Our approach has some similarities with that presented in [15] but: (i) [15] derives only synonymies, hyponymies and type conflicts (and not object cluster similarities); (ii) it requires a heavier intervention of the human expert since he must validate one “assumed similarity” at each step (whereas our algorithm requires the intervention of the human expert for validating derived properties only after the extraction of all of them); (iii) finally, we take advantage of the background knowledge whereas this is a lack of [15], as denoted by the authors themselves. The approach presented in this paper is complementary to that, described in [23]: indeed [23] improves the description of types by associating a semantics to each of them in order to explain the terms of application domain and their interrelations; these improvements on type description could be exploited for refining the part of our approach which detects type conflicts. In addition the approach for semantic interoperability presented in [23] assumes the existence of a shared ontology for schemes into consideration and admits that its construction is a difficult task; the part of our approach detecting nominal properties between scheme objects (already presented in [22]) can provide this ontology. In [21] a general approach for automatically deriving type conflicts is described based on an abstract fixpoint procedure. Here we provide a concrete technique which takes some ideas from [21] but takes into account also the guidelines of [22], which led to successfully derive nominal and structural properties. In addition we extend the approach of [al, 221 for deriving also object cluster similarities. The final result of this effort consists of the definition of a uniform framework for deriving all most common kinds of interscheme properties. In [20] a methodology for deriving object cluster similarities is proposed. This approach computes the similarity coefficients associated to each object cluster pair by taking into account only the similarities between entities belonging to involved clusters. The approach proposed here is much more precise because, for each cluster pair, it considers not only objects belonging to the structure of involved clusters, but also those belonging to their context. This choice fits the recent developments relative to the theory of interscheme property extraction (see [8, 121) where it was proved that properties derived without considering contexts are not completely reliable and it was stressed the importance of the attribute semantic relevance in the extraction of interscheme properties. tin this table 0~~1 indicates the object 0 of the scheme S. tin this table (C)fS] indicates the cluster C of the scheme S

A Uniform Methodology

for Extracting

Type Conflicts and Subscheme Similarities

551

6. CONCLUSIONS In this paper we have proposed a new automatic approach for deriving type conflicts and object cluster similarities in database schemes. Our method, unlike most of the previous ones, is automatic and, thus, it is well suited for large integration problems where manual approaches are difficult to be applied. In particular the approach described in this paper for type conflicts and object cluster similarities follows the same leading ideas as the methodology presented in [22] for deriving nominal properties. As a consequence, in the whole, we have obtained a uniform semi-automatic approach for deriving nominal properties, type conflicts and object cluster similarities. The method is based on considering pairs of objects having different types (resp., pairs of clusters) belonging to different schemes and on measuring their similarity. To this purpose, object (resp., cluster) structures as well as object (resp., cluster) contexts are analyzed to verify similitudes and differences. Some examples have been proposed in the paper for illustrating the suitability of our approach to effectively detect type conflicts and object cluster similarities. The effectiveness of the proposed approach has been tested over numerous application cases; in particular our main test case has been the set of Italian Central Governmental Office databases; in this paper we have shown the application of our algorithms to five of these databases. Presently we are completing the integration of this approach into a more complex system, called D.I.K.E. (Database Intensional Knowledge Extractor), that we are developing at D.E.I.S. Universita della Calabria, whose overall approach consists of the following steps: l

l

l

the enrichment of scheme description, obtained by (semi-automatically) extracting nominal and structural properties, type conflicts, object cluster similarities and assertions involving complex patterns relative to intensionally defined objects; the exploitation of derived interscheme properties for obtaining, in a semi-automatic fashion, an integrated and abstracted representation of available data encoded in a Data Repository; the exploitation of both the repository and interscheme properties, derived in the previous steps, for supporting the designer in realizing mediator-based Cooperative Information Systems or Data Warehouses over available data.

Acknowledgements The authors gratefully acknowledge the Autorith per I’Informatica nella Pubblica Amministrazione (AIPA) for kindly providing the schemes of ICGO and technical support with them. The authors gratefully thank Luigi Palopoli and D. Sacc?~ for many inspiring discussions about the arguments of the paper. Domenico Ursino is also supported by ISI-CNR.

REFERENCES [l] C. Batini, S. Castano, V. De Antonellis, M.G. Fugini, and B. Pernici. Analysis of an inventory of information Requirement Engineering Journal, l( 1):47-62 (1996). systems in the public administration. [2] C. Batini, S. Castano, M.G. Fugini, P. Naggar, M. Pellizzoni, and B. Pernici. Tecniche per l’analisi di descrizioni di processi nella pubblica amministrazione. In Atti de1 Congress0 Ann&e dell’AICA (AICA ‘g5), pp. 247-258, Cagliari, Italy. (In Italian) (1995). [3] C. Batini, S. Ceri, and S.B. Navathe. Company (1992).

Conceptual

Database

Design.

[4] C. Batini and M. Lenzerini. A methodology for data schema integration Pansactions on Software Engineering, 10(6):650-664 (1984). [5] C. Batini, M. Lenzerini, and S.B. Navathe. A comparative integration. ACM Computing Surveys, 15(4):323-364 (1986).

The Benjamin/Cummings

Publishing

in the entity relationship model. IEEE

analysis of methodologies

for database

[6] M.W. Bright, A.R. Hurson, and S. Pakzad. Automated resolution of semantic heterogeneity ACM tinsactions on Database Systems, 19(2):212-253 (1994). [7] S. Castano and V. De Antonellis. Reference conceptual architecture for re-engineering International Journal of Cooperative Information Systems, 4(2):213-235 (1995).

scheme

in multidatabases.

information

systems.

[8] S. Castano and V. De Antonellis. Semantic dictionary design for database interoperability. In Proc. of Intemational Conference on Data Engineering (ICDE’g7), pp. 43-54, Birmingham, United Kingdom. IEEE Computer Society (1997).

GIORGIO TERRACINA and DOMENICO URSINO

552 [9] S. Castano, applications.

V. De Antonellis, M.G. Fugini, and B. Pernici. Conceptual Technical Report, Politecnico di Milano, pp. 95-160 (1995).

schema analysis:

[lo]

T. Catarci and M. Lenzerini. Representing and using interschema knowledge in cooperative Journal of Intelligent and Cooperative Information Systems, 2(4):375-398 (1993).

[ll]

S. Chaudhuri and U. Dayal. An overview of data warehousing and olap technology. 26(1):65-74 (1997).

[12] P. Fankhauser, M. Kracker, and E.J. Neuhold. RECORD, 20(4):59-63 (1991). [13] Z. Galil. (1986).

Efficient algorithms

Techniques

and

information systems.

ACM SIGMOD

RECORD,

Semantic vs. structural resemblance of classes. ACM SIGMOD

for finding maximum

matching in graphs.

ACM

Computing

Surveys,

18:23-38

D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, V. Vassalos, and [14] H. Garcia-Molina, Y. Papakonstantinou, J. Widom. The tsimmis approach to mediation: Data models and languages. Journal of Intelligent Information Systems, 8:117-132 (1997). [15] W. Gotthard, P.C. Lockemann, and A. Neufeld. System-guided view integration for object-oriented IEEE Transactions on Knowledge and Data Engineering, 4(1):1-22 (1992). [16] J.A. Larson, S.B. Navathe, and R. Elmastri. A theory of attribute equivalence in databases to schema integration. IEEE Transactions on Softvrore Engdneering, 15(4):449-463 (1989).

databases.

with application

[17] A. Levy, A. Rajaraman, and .I. Ordille. Querying heterogeneous information sources using source descriptions. In Proc. of 22th International Conference on Very Large Data Bases (VLDB’SG), pp. 251-262, Bombay, India. Morgan Kaufmann (1996). [18] M.V. Mannino and W. Effelsberg. Matching Conference on Data Engineering (ICDE’Sd), Society (1984). [19] S.B. Navathe, R. Elmasri, 19(1):50-62 (1986).

and J.A. Larson.

techniques in global schema design. In Proc. of International pp. 418-425, Los Angeles, California, USA. IEEE Computer Integrating

user views in database

design.

IEEE

Computer,

[ZO] L. Palopoli, D. Sac&, G. Terracina, and D. Ursino. A unified graph-based framework for deriving nominal interscheme properties, type conflicts and object cluster similarities. In Proc. of Fourth IFCIS Conference on Cooperative Information Systems (CoopIS’SS), pp. 34-45, Edinburgh, United Kingdom. IEEE Computer Society (1999). [21] L. Palopoli, D. Sac& and D. Ursino. An automatic technique for detecting type conflicts in database schemes. In Proc. of ACM Conference on Information and Knowledge Management (CIKM’98), pp. 306-313, Bethesda, Maryland, USA. ACM Press (1998). techniques for deriving int,erscheme properties [22] L. Palopoli, D. Sac& and D. IJrsino. Semi-automatic database schemes. Data 0 Knowledge Engineering, 30(4):239%273 (1999). [23] E. Sciore, M. Siegel, and A. Rosenthal. Using semantic values to facilitate interoperability information systems. ACM Transactions on Database Systems, 19(2):254-290 (1994).

from

among heterogeneous

[24] S. Spaccapietra and C. Parent. View integration: A step forward in solving structural conflicts. actions on Knowledge and Data Engineering, 6(2):258-274 (1994).

IEEE

Tmns-

[25] J.D. Ullman. Information integration using logical views. In Proc. of International Conference on Database Theory (ICDT’97), pp. 19-40, Delphi, Greece. Lect,ure Notes in Computer Science, Springer-Verlag (1997). [26] J. Widom. Knowledge

Research problems in data warehousing. In Proc. of International Conference on Information Management (CIKM’98), pp. 25-30, Baltimore, Maryland, USA. ACM Press (1995).

[27] G. Wiederhold. (1992).

Mediators

in the architecture

of future information

systems.

IEEE

Computer,

and

25(3):38-49