InformationSystems Vol. 25, No. 8, pp. 527-552,200O 0 2001 Elsevier Science Ltd. All rights reserved printed in Great Britain 0306-4379/00 $20.00 PB: s0306-4379(00)00034-x
Pergamon
A UNIFORM METHODOLOGY FOR EXTRACTING TYPE CONFLICTS AND SUBSCHEME SIMILARITIES FROM HETEROGENEOUS DATABASES GIORGIO Dipartimento
di Elettronica,
Informatica
and DOMENICO
TERRACINA e Sistemistica,
Universita
URSINO
degli Studi della Calabria,
Via Pietro Bucci,
87036 Rende (CS), Italy (Received 29 April 1999; in final revised form 6 October 2000) Abstract Cooperative Information Systems have been proposed to allow a uniform access to heterogeneous data yet preserving their operational autonomy. They use global dictionaries defined on the basis of interscheme properties; these include nominal and structural properties, type conflicts and object cluster similarities. Whereas in the literature a certain number of techniques has been proposed for deriving nominal and structural properties, few approaches exist for detecting type conflicts and object cluster similarities. The type of an object indicates if it is an entity, a relationship or an attribute; type conflicts indicate the existence of objects representing the same concept yet having different types. Object cluster similarities denote similitudes between portions of different schemes. This paper proposes an automatic, probabilistic approach to the detection of type conflicts and object cluster similarities in database schemes. The method we are proposing here is based on considering pairs of objects having different types (resp., pairs of clusters), belonging to different schemes and on measuring their similarity. To this purpose object (resp., cluster) structures as well ss obiect (resp.. cluster) neighborhoods are analyzed-to verify similitudes and differences. A number of examples show; the suitability of our techniques to effectively detect type conflicts and object cluster similarities. 0 2001 Elsevier Science Ltd. All rights reserved Key words: Type Conflict Detection in Database Schemes, Derivation of Similarities between Database Subschemes, Automatic and Semantic Approaches for Detecting Interscheme (i.e., Nominal and Structural) Properties from Heterogeneous Databases
1. INTRODUCTION In the last years, an enormous increase of data available in electronic form has been witnessed, as well as a corresponding proliferation of query languages, data models and systems for data management. Nowadays, heterogeneous data management systems often coexist within the same operational environment. Traditional approaches to data management do not seem to guarantee, in these cases, the needed level of access transparency to stored data while preserving the autonomy of local databases. This situation contributed to push the development of new approaches which led to the design of Cooperative Information Systems (CIS) [14, 17, 25, 271 and Data Warehouses (DW) [ll, 261. CIS and DW indeed allow for users to query pm-existing autonomous data sources in a way that guarantees model, language and location transparency. A main problem to be solved in designing such access systems relies in scheme integration, i.e., the activity by which different input database schemes are merged into a unique global structure describing the whole information set available for the query purposes. For very large systems, however, scheme integration alone typically ends up with producing a too complex global scheme that may, in fact, fail to supply a satisfactory and convenient description of available data. In these cases, scheme integration steps must be completed by executing some scheme abstraction steps [l]. Carrying out scheme abstraction amounts to clustering objects belonging to a scheme into homogeneous subsets and producing an abstracted scheme obtained by substituting each subset S with one single object representing S. In order for scheme integration and abstraction to be correctly carried out, the designer has to clearly understand the semantics of involved database schemes. In such system re-engineering problems, the design emphasis is on the integration of preexisting information components, where a key problem is that of deriving relations holding among objects in preexisting schemes [l]. Therefore, methodologies are needed to extract properties from schemes. The most interesting of these are inter-scheme properties that relate objects belonging to different schemes [l, 2, 4, 5, 6, 7, 8, 9, 10, 12, 15, 16, 18, 19, 23, 241. These can be classified into: 527
GIORGIO TERRACINA and DOMENICO URSINO
528
l
Nominal: these are synonymies (i.e., two objects have the same meaning and the same typet), homonymies (i.e., two objects have the same name and the same type but different meanings) and hypernymies (i.e., one object has a more specific meaning than another one of the same type).
.
Structural: these denote
.
Type Conflicts: a type conflict between two objects indicates that they represent the same concept, yet having different types (e.g., one is an entity and the other is a relationship, one is an attribute and the other is an entity, and so on). The existence of a type conflict between two objects denotes that they are two different representations of the same portion of the reality; such conflicts derive from the fact that different designers may have different perceptions of the same reality.
.
Object Cluster Similarities: an object cluster represents a set of connected scheme, i.e., a subscheme; an object cluster similarity denotes the similitude portions of two different schemes.
inclusions
of an object
into another
one.
objects in a between two
In the literature a certain number of approaches for extracting nominal and structural properties has been proposed [4, 5, 6, 8, 12, 15, 22, 241; on the contrary, there is a small number of proposed approaches for discovering type conflicts [15, 21, 241 and object cluster similarities [15, 20, 241. When the amount of involved data is huge and/or when they change quite frequently over time, manual approaches to derive interscheme properties do not suffice and the necessity arises of automatic techniques. Indeed, automatic approaches are much less time consuming than manual ones and so a try-and-check methodology for the extraction of interscheme properties can be successfully adopted. Finally, updating and maintaining dictionaries obtained by an automatic extraction process is simpler. In the literature only some of the proposed approaches for deriving interscheme properties are automatic; most of them are, in fact, manual. In this paper we propose a novel automatic technique for detecting type conflicts and object cluster similarities in database schemes defined by E/R diagrams. The approach is based on computing a similarity degree between pairs of objects 01 E S1 and 02 E 5’2 having different types (for type conflict detection) or pairs of object clusters 6’1 E Sl and C2 E S2 (for object cluster similarity detection). Similarity degrees are expressed using values in the real interval [0,11. Type conflicts and object cluster similarities are detected in the first place as probabilistic properties of scheme objects. Then, only those characterized by a sufficiently high similarity degree are returned as actual detected conflicts. In order to derive similarity degrees associated to type conflicts (object cluster similarities, resp.), for each scheme object (cluster, resp.) our algorithm considers its structure and its context 18, 221. The structure of an object consists of its attributes whereas the structure of a cluster consists of the objects belonging to it. The context of an object 0 (resp., of a cluster C) indicates the objects belonging to the neighborhood (see below) of 0 (resp., of C). Most of the automatic approaches proposed in the literature for deriving interscheme properties are based on comparing structures (i.e., object attributes); only few of them consider also object contexts. Our approach exploits structures and contexts in deriving similarity degree because we argue that (i) objects having the same real world semantics are very often characterized by the presence of common elements in their context; (ii) it is generally accepted that similarity techniques derived taking into account contexts are more precise than techniques based only on the attribute analysis [6, 8, 221. Finally, for attributes we have to take into account their semantic relevance, i.e., the relevance in distinguishing the semantics of entities/relationships they belong to [12]. For example the attribute Salary is more relevant than the attribute Identifier for distinguishing the entity Employee from the entity Car. The approach presented in this paper uses some weights and thresholds; we have exploited several example cases for tuning up their optimal values and for validating the approach itself after ‘The type of an object object types are sometime
0 in an E/R scheme indicates if 0 is an entity, called meta-types in the literature.
a relationship
or an attribute.
Scheme
A Uniform Methodology
for Extracting
Type Conflicts and Subscheme Similarities
529
the tuning phase. The most interesting and complete test case has been the set of database schemes relative to Italian Central Governmental Offices (ICGO). ICGO own about 300 databases many of which having a complex structure and containing large amounts of data. First a tuning phase has been carried out for fixing weights and thresholds; then the algorithms have been applied and their results have been validated. To this purpose we have compared results produced by our approach with those obtained manually by “Italian Information Systems Authority for Public Administration” (AIPA); the comparison has demonstrated that our algorithms are capable of producing high quality results yet requiring a short amount of time. In more detail, we generally obtained the same results as those derived manually by AIPA; however AIPA took about three years for examining all ICGO databases and, after that period, some of the databases into consideration had been modified, other databases had been created and so on; therefore manual derivation of interscheme properties appeared to be not able to face ICGO database changings. Our approach could require several weeks for examining all ICGO databases; therefore, on the one hand, it appears appropriate for handling a great number of (possibly) large databases; on the other hand, it takes into account scheme semantics and obtained results are satisfactory and very similar to those derived by manual techniques. The description of one of the experiments we have carried out on ICGO databases can be found in Section 4. We point out that the approach described in this paper for type conflicts and object cluster similarities follows the same leading ideas of [22]. In particular, in [22] an algorithm is presented for deriving nominal properties. The algorithm consists of four steps. The first one derives the so called rough synonymies between entities, obtained taking into account object structures only. The second step computes synonymies between relationships using their context, rough synonymies derived during the previous step and lexicographic attribute synonymies. More refined values for entity synonymies, as well as entity homonymies, are searched for by exploiting context information that can be set up through rough synonymies between entities and synonymies between relationships obtained in the previous two steps. The fourth step is a filtering phase which discards synonymies whose plausibility is under a certain threshold. As a consequence, in the whole, we propose a
Definition 1 Let A be an attribute. The structure of A consists of A itself. The context of A consists of both the object 0, which A belongs to, and the structure of 0 (except, obviously, A itself), Let E be an entity. The structure of E consists of the set of its attributes. The context of E consists of relationships it takes part into along with their attributes and all other entities linked to these relationships along with their attributes. Let R be a relationship. The internal structure of R consists of its attributes. The structure of R consists of its internal structure plus the entities linked by R and their attributes. The context of R consists of entities linked (through some relationship) to entities belonging to the structure of R. In the sequel, we shall use the following support functions: EStructure(E) (resp., RStructure(R)), that takes in input an entity E (resp., a relationship R) and yields in output objects belonging to its structure; AContezt(A) (resp., (i) EContezt(E), (ii) RContezt(R)) that receives an attribute
GIORGIOTERRACINAand DOMENICOURSINO
530
A (resp., context. 2.2.
(e) an entity E, (ii) a relationship
General
R) and returns the set of objects
belonging to its
Characteristics
In this section we illustrate an algorithm for the derivation of type conflicts. Each type conflict will be denoted by a triplet of the form [A, B, fl, where A and B are the involved objects and f is a fuzzy coefficient, in the real interval [0, l], which expresses the plausibility of the property. The proposed algorithm activates some functions, one for each kind of type conflicts. In particular these functions are: (i) EA_E-Ccmf, which derives conflicts between entity attributes and entities; (ii) RAE_Conf, which derives conflicts between relationship attributes and entities; (iii) EA-R-Con f, which detects conflicts between entity attributes and relationships; (iv) RA_R_Ccmf, which determines conflicts between relationship attributes and relationships; (v) ER-Cmf, which detects conflicts between entities and relationships. For each object into consideration, the algorithm examines both its structure (taking also into account the semantic relevance of attributes forming it) and its context. The derivation of each kind of type conflicts generally consists of three steps, each of which is implemented by a suitable function. The first one derives the so-called rough type conflicts, taking into account only the structure of the involved objects. In the second step refined type conflicts are derived taking into account both rough type conflicts, detected in the previous step, and object contexts. Finally, the third step singles out the most interesting among the discovered type conflicts. The proposed algorithm exploits some basic background knowledge about involved databases consisting in: l
A Lexicographic Synonymy Property Dictionary LSPD, storing triplets of the form [A, B, f], where A and B are the involved names and f is a fuzzy coefficient, in the real interval [0,11, expressing the plausibility of the property. Constructing the LSPD consists of two steps: - stating lexical synonymies either by using a standard thesaurus (such as Wordnet) or by enquiring with experts of the domain; _ setting plausibility coefficients; since standard thesauri do not provide plausibilities synonymies, in this step, the support of domain experts is mandatory.
for
In order to obtain a high objectivity, groups of human experts can be asked to supply plausibility coefficients and mean values are assumed. In particular, human experts are required to specify plausibilities choosing among five possible values, namely H, MH, M, ML, L, to denote a High, Medium High, Medium, Medium Low and a Low plausibility, resp. These specifications correspond to values 1.0, 0.8, 0.6, 0.4 and 0.2, resp., of the corresponding plausibility coefficients. In real application settings, however, the possibility exists for errors to occur in plausibility values associated to lexicographic similarities provided by experts, although we are entitled to assume that possible errors do not involve most coefficients. By conducting a sensitivity analysis based on varying the values specified for lexicographic similarities, we have shown that, under the above mentioned condition, the results yielded by our techniques are not significantly influenced by ill-specified plausibility coefficients occurring in the LSPD. l
A Synonymy Property Dictionary SPD, B are involved objects and f is a fuzzy the plausibility of the property. In the extracting synonymy properties and for 20, 22, 241.
The main algorithm for extracting
storing triplets of the form [A, B, f ] , where A and coefficient, in the real interval [0, 11,that expresses literature some algorithms have been proposed for constructing a Synonymy Dictionary [6, 8, 12, 15,
type conflicts is as follows:
A Uniform Methodology for Extracting Type Conflicts and Subscheme Similarities
Algorithm
for Extracting
Type
531
Conflicts
Input: it list S of n database schemes; a dictionary SPD of synonymy properties and a dictionary LSPD of lexicographic synonymy properties; Output: a dictionary TCD of type conflicts; begin
TCD := EA_E_Conf
(S, SPD, LSPD)
EA_R_Conf
(S, SPD, LSPD)
ER_Conf(S,
SPD, LSPD)
u RAE_Conf(S,
U RARXonf
SPD, LSPD)
(S, SPD, LSPD)
u
u
end 2.3. Discovering Conflicts between Entity Attributes and Entities The function EA_E_Conf derives conflicts between entity attributes and entities. This derivation is obtained into two separate phases: the first phase determines rough type conflicts by taking into account only object structures, whereas the second phase extracts refined type conflicts by using both rough type conflicts and object contexts. In more detail the function is the following:
Function EA_E_Conf (S: a list of n database schemes; SPD: a Synonymy Property Dictionary; LSPD: a Lexicographic Synonymy Property Dictionary): return a set of type conflicts between entity attributes and entities; var RTC:
a set of (rough) type conflicts between entity attributes and entities;
TTC:
a set of type conflicts between entity attributes and entities;
begin RTC := EA_E_Rough(S,LSPD); TTC := EA_E_Refined(S,SPD,LSPD,RTC); TTC := EA_E_Discard_ Weak(TTC); return
TTC
end Next, we are going to give some details on the procedures listed above. 23.1.
Computing Rough Entity Attribute - Entity Conflicts
The function for computing rough conflicts between entity attributes and entities analyses pairs of distinct schemes. For each pair of objects in the two schemes, and representing an entity attribute and an entity, resp., the function checks in the LSPD for the presence of a lexicographic synonymy between names of both the entity attribute and the entity. The corresponding function is as follows: Function EA_E_Rough (S: a list of n schemes; LSPD: a Lexicographic Synonymy Property Dictionary): return a set of (rough) type conflicts between entity attributes and entities; var RTC: a set of type conflicts between entity attributes and entities; h, k : 1 .. n; begin
GIORGIOTERRACINAand DOMENICOURSINO
532 RTC := 0; for h:= 1 to n do
for k:= 1 to n such that h # k do for each pair consisting
of an entity
Ai E Sh and an entity
attribute
Ei E Sk
do
RTC := RTC U [Ai, Ej, LexSyn(Ai, return
Ej, LSPD)];
RTC
end
The function Lex-Syn checks, in the LSPD, the presence of a lexicographic synonymy between names of an attribute Ai and an entity Ej, which it receives in input, and yields in output its plausibility coefficient, if it holds, 0 otherwise.
2.3.2.
Computing Refined Entity Attribute - Entity Conflicts
This function derives a more refined value for the coefficient associated to each conflict between an entity attribute (say AE) and an entity (say E) by exploiting (i) rough type conflicts derived by the previous procedure, (ii) synonymies between attributes of the contexts of AE and E, (iii) synonymies between entities of the contexts of AE and E. The function is as follows: EA_E_Rejined(S: a list of n schemes; SPD: a Synonymy Property Dictionary; LSPD: a Lexicographic Synonymy Property Dictionary; RTC: a set of (rough) type conflicts between entity attributes and entities): return a set of type conflicts between entity attributes and entities; Function
VZll-
A, Q: a set of reals in the interval
[0,11;
h, k : 1.n; TTC:
a set of type conflicts
between
entity
attributes
and entities;
begin
TTC := 8; for h := 1 to n do for k := 1 to n such that h # k do for each pair consisting
of an entity
attribute
Ai E S,, and an entity
Ej E Sk
do begin
AA~E~ := EA_E_Att_Syn(Ai, QA;E, := EAEEnt_Syn(Ai,
Ej, SPD);
Let [Ai, Ej, ~A,E, 1 be a tuple
belonging
TTC := TTCU end; return end
Ej, LSPD);
TTC
to RTC;
[Ai, Ej, EA_ERefined_Val(AAiEj,Q~,Ej,
fA,E,)j
A Uniform Methodology for Extracting Type Conflicts and Subscheme Similarities
533
Function EAEAtt-Syn The function EA_E_Att_Syn computes synonymies between attributes belonging to the context of an attribute Ai and the context of an entity Ej, which it receives in input. This function is as follows: Function EA_E_Att_Syn(Ai: an entity graphic Synonymy Property Dictionary):
attribute; Ej: an entity; return a real E [O, 11;
LSPD:
a Lexico-
const w, = 1; var SP: a set of attribute
synonymies;
L, M : a set of attributes; F: a matrix
of reals;
begin SP := 0; L := { Al 1 Al is an attribute M := { A,
1A,
and Al E AContezt(Ai)
and A,,, E ECmtext(Ej)
is an attribute
Ai E L and A,
for each pair of attributes SP := SP U (AI, A,,
};
W_Mean(AI,
};
E M do
A,, LSPD));
SP := Obj_Discard(SP); F := Mat(L, M, SP); return Matching(L,
M, F, u,)
end The function W-Mean (already introduced in [22]) computes the synonymy coefficient associated to a pair of attributes (Al, A,) taken in input. To this end, three characteristics are considered, namely, attribute name, domain, and key characterization; the key characterization of an attribute tells if the attribute is a primary key (PK), a secondary key (SK) or not a key (NK) for the entity it belongs to. The function returns the result of the following expression: WN
x NA,A,
+
WD
x DA,A,
+
WK
x KA~A,,,
where WN, WD and WK are weighting factors, whose optimal values have been experimentally set to O&O.15 and 0.05, respectively; NA,A, is the value of the lexicographic synonymy between Ai LSPD; the value of DA,A, is set to 1 if the domains of and Am, as resulting from the dictionary AI and A, are the same, to 0.5 if they are compatible (default compatibilities are integers with reals and chars with strings; further, more specific, compatibilities can be provided by the human expert) and to 0 otherwise. Finally, K,+A, is set to 1 if Al and A, are both PK or both SK or both NK, to 0.5 if one of them is PK and the other is SK or if one of them is SK and the other is NK, to 0 if one of them is PK and the other is NK. The function Obj_Discard considers each tuple belonging to SP and normalizes the corresponding plausibility value, by setting it to 0 if this was previously under a certain given threshold th,. The optimal value of th, has been experimentally set to 0.25 (see the Introduction and Section 4 for the description of these experiments). The function Mat creates a matrix F having a row for each Al E L, a column for each A, E M and F[Al, A,] equal to the plausibility value associated in SP to the pair Al and A,. The function Matching computes a factor obtained from calculating a maximum weight matching. The input here are two sets of objects L = (11,. . . ,l,} and M = {ml,. . . , m,} and a cost matrix F on L and M such that, for each la E L and mj E M, 0.0 5 F[l,,mj] 5 1.0. The output is a value v in the real interval [0,11. If L = 0 or M = 0 then Matching returns TV= 0. Otherwise, let BG = (L U M, A) be a bipartite weighted graph, where A is the set of weighted edges
534
GIORCIOTERRACINAand DOMENICOURSINO
{ (Zi, mj, fij) ) fij > 0); the maximum weight matching for BG is a set A’ C A of edges such that for each node x E LUM there is at most one edge of A’ incident onto x and d(A’) = &;,mj,fijJEA, fij Now, let s(A’)
is maximum+.
= $$.
Th e value TJreturned
by Matching is defined
abs(lLI-IM1)+2x(min(II;I,JM()-IA’lr lLl+lMI
where w, is a constant has been experimentally Function
used to weight the importance set to 1.
of unrelated
>
as:
x ;i;W
edges.
The optimal
value of wV
EA_EXnt_Syn
The function EA_E_Ent_Syn takes in input an attribute Ai, an entity Ej and a dictionary SPD of synonymy properties and returns a synonymy coeflicient as the maximum of plausibility coefficients relative to all entities related to the entity Ej and the entity which Ai belongs to. More specifically the function is the following: Function
Property
EA_E_Ent_Syn(Ai: an entity attribute; Dictionary): return a real E [0,11;
Ej: an entity,
SPD:
a Synonymy
var
SP: a set of entity
synonymies;
begin SP := 0; Let El be the entity for each entity
in the AContext(
E,,, E EContext(Ej)
if LEl, Em, ~E[E,J E SPD
do
then
SP := SP u (El, Em, LYE,) else
SP := SP U (El, E,, 0); return
Max(SP);
end
The function Max takes in input the set SP of synonymies the maximum of all plausibility values. Function
between
objects
and yields in output
EAERefined_Val
The function EA_E_Refined_Val takes in input the real numbers AA;E,, QA;E, and f&E, and computes their weighted mean with the coefficients VA = 0.3, VQ = 0.25 and Vf = 0.45. 2.3.3.
Discarding Weak Type Conflicts
EA_E_Discard_ Weak takes in input a set TTC of (candidate) type conflicts bety;;oentity attributes and entities. It computes an interest threshold: th,, = max (6, TTCMAx MuIL), where TTCMAX and TTCM~N are the maximum and the minimum plausibility val& in TTC and 6 is a real number E [O, l] whose value has been experimentally set to 0.6. The function then discards all type conflicts whose coefficient is smaller than th,,. As already stated, all type conflicts not discarded by EA_E_Discard_ Weak are added to the Type Conflict Dictionary TCD.
A Uniform Methodology
for Extracting
Type Conflicts and Subscheme Similarities
535
33
(1.1)
Idmtitia
Duration
warranty
Fig. 1: Scheme AD: the Administration
2.3.4.
Department
Database
An Example
Consider the schemes in Figures 2 and 1 representing, respectively, Administration AD) and Production (denoted PD) departments of an organization. Assume the following tuples belong to LSPD: [Id, SSN, 0.81 [BirthDate, Date, 0.91 [Name, Name, 11 [Sala y, Stipend, 0.91 [Birthplace, District, 0.451 (Reference-Area, Reference-Section,
(denoted
[Id, Name, 0.1 [Name, SSN, 0.1 [Home, Address, 0.91 [Birthplace, Zip Code, 0.4 [BirthPlace, Town, 0.81 0.81
We show how the function EAE_Conf works for schemes given in the example. Due to space limitations we consider only the pair consisting of the attribute Birthplace of scheme AD and the entity Town of scheme PD. The function EA_ERough inserts the tuple [Birthplace,
Town, LexSyn(BirthPlace,
Tmn,
LSPD)
= 0.81
into RTC. Consider now EAERefined. It begins with calling EAEAtt_Syn. When this is activated it constructs: L := {Id, BirthDate, Name, Home, Salary} M := {SSN, Name, Address, Function, Stipend, Date} For each pair of attributes Al E L and A, E M, their plausibility coefficient is derived by applying the function W-Mean and the corresponding tuple is inserted into SP. Tuples constituting SP are. therefore: tA survey on maximum weight matching can be found in [13].
GIORGIO TERRACINAand DOMENICOURSINO
536
Cod.5
Zip-Code NUIK Rcsident_Number District
NZUM
Reference_Area Occupied_Suface
Fig. 2: Scheme PD: the Production
(Id, SSN, 0.84) (Id, Function, 0.15) (Birthdate, SSN, 0) (BirthDate, Function, 0.05) (Name, SSN, 0.74) (Name, Function, 0.18) (Home, SSN, 0.15) (Home, Function, 0.2) (Salary, SSN, 0) (Salary, Function, 0.05)
Department
(Id, Name, 0.74) (Id, Stipend, 0) (BirthDate, Name, 0.03) (BirthDate, Stipend, 0.05) (Name, Name, 1) (Name, Stipend, 0.03) (Home, Name, 0.18) {Home, Stipend, 0.05) (Salary, Name, 0.03) (Salary, Stipend, 0.92)
Database
(Id, Address, 0.15) (Id, Date, 0) (BirthDate, Address, 0.05) (BirthDate, Date, 0.92) (Name, Address, 0.18) (Name, Date, 0.03) (Home, Address, 0.92) (Home, Date, 0.05) (Salary, Address, 0.05) (Salary, Date, 0.05)
The function Obj-Discard considers each tuple in SP; for those having less than 0.25 it modifies the plausibility coefficient setting it to 0. The function Mat yields in output the following matrix:
F=
i
0.84 0 0.74 0 0
0.74 0 1 0 0
0 0 000 0 0 0.92 0 0 0
0 0 0 0.92
= w
= y
= 0.92
coefficient
0 0.92 0 0 0
The function Matching computes the maximum weight matching: A’ = {(Id, SSN, 0.84), (BirthDate, Date, 0.92), (Name, Name, (Salary, Stipend, 0.92)) $(A’)
a plausibility
l), (Home,
Address,
0.92),
v = (1 - $ x i”i) x 0.92 = 0.88
Then, the function EA-E-Ent-Syn is activated. The entity belonging is Employee; the entity belonging to EContext(Town) is Subordinate.
to ilContext(BirthPlace) Any algorithm computing
A Uniform Methodology for Extracting Type Conflicts and Subscheme Similarities
537
entity synonymies finds a synonymy to hold between these entities. For example the algorithm in [22] returns a synonymy with a plausibility coefficient equal to 0.57; suppose this synonymy is present, in SPD; when the function EA_E_EntSyn is called, (Subordinate, Employee,0.57) is stored in SP; consequently Max(SP) = 0.57 and this value is returned by the function. Therefore EA_ERefined_Val returns 0.3 x 0.88 + 0.25 x 0.57 + 0.45 x 0.8 = 0.77 and the tuple [Birthplace, Town, 0.771 is stored in the TTC. The function EA_EDiscard_Weak considers this property valid and stores the corresponding tuple in the Type Conflict Dictionary TCD. 2.4. Discovering
Conflicts between Entities and Relationships
The function ER_Conf be encoded as follows:
derives conflicts between entities and relationships. The function can
Function E_R_Conf (S: a list of n database schemes; SPD: a Synonymy Property Dictionary; LSPD: a Lexicographic Synonymy Property Dictionary): return a set of type conflicts between entities and relationships; VaF
RTC:
a set of (rough) type conflicts between entities and relationships;
TTC:
a set, of type conflicts between entities and relationships;
begin RTC := ERRough(S,
LSPD);
TTC := E_RRe f ined(S, SPD, RTC); TTC := E_RDiscaTd_Weak(TTC); return TTC end The three procedures called by ER_Conf 24.1.
Computing Rough Entity-Relationship
are illustrated in the following subsections. Conflicts
The function E-R-Rough proceedes by analysing pairs consisting of an entity and a relationship, belonging to different schemes and, for each pair, computes a coefficient stating the synonymy degree of the involved objects. To this purpose, it considers all possible pairs of attributes, one belonging to the entity structure and the other to the relationship structure and, for each pair, computes the associated similarity degree. The function is encoded as follows: Function E_R_Rough(S: a list of n database schemes; LSPD: a Lexicographic Synonymy Property Dictionary): return a set of (rough) type conflicts between entities and relationships; const
SP: a set of synonymies between attributes; RTC:
a set of type conflicts between entities and relationships;
h, k : 1 . . n; L, M: a set of attributes; F: a matrix of reals; begin
GIORGIOTERRAGINAand DOMENICOURSINO
538
RTC := 0; for h:= 1 to n do for k:= 1 to n such that h # k do for each pair consisting
of an entity
Ei E Sh and a relationship
Rj E Sk do
begin SP :=
0;
L := { AI 1AI is an attribute M := { A,
1A,
and Ai E EStructwe(Ei)
is an attribute
for each pair of attributes SP := SP U (Al, A,,
and A,
Al E L and A, W_Mean(Al,
E RStructwe(Rj)
}; };
E M do
A,, LSPD,w,));
SP := ObjDiscard(SP); SP := Att_Relevance(SP); F := Mat(L, M, SP); RTC I= RTC U [Ei, Ri, Matching(L,
M, F, w,)]
end; RTC := Normalize(RTC); return RTC end Functions W-Mean, Mat and Matching are the same as those in EA_E_Att_Syn. The function Att_Relevance takes in input a set SP of synonymies between attributes and performs a normalization of corresponding plausibility coefficients taking into account the semantic relevance of the involved attributes, i.e., their relevance in distinguishing the semantics of objects they belong tot [12]. The leading idea is that, in the definition of the object semantics, primary key attributes are less influencing than candidate non-primary key attributes that are in turn less influencing than non-key attributes. More in particular, for each tuple (AI, A,, ~,J,A,,,) E SP, the function multiplies ~A,A,,, by a suitable coefficient CA,A, which takes into account the semantic relevance of the attribute pair. Values for CA,A, have been obtained experimentally; in particular CA~A, is equal to: (i) 0.4, if both Al and A, are primary keys; (ii) 0.6, if an attribute is a primary key and the other is a non-primary key; (iii) 0.65, if an attribute is a primary key and the other is not a key; (iv) 0.8, if both Al and A, are non-primary keys; (v) 0.9, if an attribute is a non-primary key and the other is not a key; (vi) 1, if both Al and A, are not keys. The function Normalize takes in input a set RTC of type conflicts between entities and relationships and performs a final normalization on the set of coefficients provided as its input. Indeed, because of the computation performed by the function Att_Relevance, the synonymy coefficients we have computed for type conflicts are underestimated. Therefore the function Normalize sets the plausibility coefficient f of each entry of RTC to N x f, where N is a suitable normalization coefficient. Due to space limitations we cannot describe here how N is obtained.
24.2. Computing Refined Entity Relationship Conflicts This function is devoted to derive more refined entity-relationship the algorithm exploits both previously derived rough type conflicts texts. The function is as follows:
type conflicts. To this end, and synonymies between con-
Function E-R-Refined (S: a list of n database schemes; SPD: a Synonymy Property Dictionary; RTC: a set of (rough) type conflicts between entities and relationships): return a set of type conflicts between entities and relationships; tFor example, as said in the Introduction, the attribute distinguishing the entity Employee from the entity Car.
Salary is more relevant than the attribute
Identifier
for
A Uniform Methodology
for Extracting
Fig. 3: Two Different Representations
A,Q: a set of reals in the interval
539
Type Conflicts and Subscheme Similarities
of Marriages
[0, 11;
h,k : l..n; TTC:
a set of type conflicts
between
entities
and relationships;
begin TTC
:= 0;
for h:= 1 to n do for k:= 1 to n such that h # k do for each pair consisting
of an entity
Rj E Sk do
h’i E Sh and a relationship
begin QE;R~ := E_R_Ent_Syn(Ei,
Rj, SPD);
Let [Ei, Rj, fEiRj 1 be a tuple TTC
belonging
to RTC;
:= TTC U [Ei, Rj, E_R_Refined_Val(fE;R,,
QE;R~)~
end; return TTC end The function being that: -L:={El(E’
EREnt-Syn I 1s an entity
- M := { E,,, 1E, The function
to the function
and El E EContext(Ei)
is an entity
E-R-Refined-Val
2.4.3. Discarding
is analogous
and E,
is analogous
EAE-Att-Syn,
the only differences
};
E RCmtext(Rj)
}
to EA_E_Refined_Val
but now Vf = 0.45 and VQ = 0.55.
Weak Type Conflicts
The function E-R-Discard- Weak takes in input a set TTC of type conflicts between relationships; it is analogous to EA_E_Discard_ Weak; in particular the threshold th,, set to 0.6. to max (y, TTCMax~TTC~rN ) , where y h as b een experimentally
entities and is set equal
2.4.4. An Example As an illustrating example of our methodology, consider taken from [24]): Assume that the following tuples belong to LSPD: [Date, Date, l]
the schemes shown in Figure 3 (example
[Date, BirthDate,
0.91
We show how the function E_R_Conf works for schemes given in the example. We consider only the entity EMarriage and the relationship RMarriage. Consider the function E-R-Rough. Here: M := {Date, Name, BirthDate} L := {Date, Location, Contract} For each pair of attributes Al E L and A, E M the plausibility coefficient is derived by applying the function W-Mean, and the corresponding tuple is inserted into SP. Tuples included in SP are, therefore:
GIORGIO TERRACINA and DOMENICO URSINO
540
(Date, Date, 1) (Location, (Contract,
The application
Date, 0.05) Date, 0.05)
of functions
(Date,
(Location, (Contract, Obj-Discard,
Att_Relevance
1 ( The activation
of the function
Matching
(Location, (Contract,
Name, 0.09) Name, 0)
F=
[EMarriage,
(Date, BirthDate,
0)
Name,
0
0.05) 0.05)
and Mat yields in output:
0.92
00
0
00
0
returns
0.92)
BirthDate, BirthDate,
1
(1 - 0.5 x f) x 1 = 0.67 and the tuple
RMarriage,
0.671 is stored in the RTC. Next we apply the function E-R-Refined on EMarriage and RMarriage. the function E_R_Ent_Syn is invoked; the function determines:
At the beginning,
121 := {Person}
L := {Person}
The application of the procedure for detecting entity synonymies described in [22] yields in output [Person, Person, O.SS]; therefore the function returns QE,unrringe Rh,arringe = 0.66; now the function E-RRefined-Val is activated which returns 0.45 x 0.67+0.55 x 0.66 = 0.66; therefore the tuple [EMarriage, RMarriage, 0.661 is stored in the TTC. Finally the function E-R-Discard- Weak judges that there is a conflict between Ehlarriage and RMarriage; therefore the corresponding tuple is returned as a result by E-R-Conf and is added to the Type Conflict Dictionary. 2.5.
Discovering
Other
Type Conflicts
Discovering
conflicts between relationship attributes and entit,ies is carried out by the function this is very similar to EA-E-Conf, the only relevant difference is that it considers the structure and the context of a relationship attribute instead of the structure and the context of an entity attribute. Discovering conflicts between entity attributes and relationships is carried out by the function EA-R-Conf; this is analogous to EA-E-Conf, except that it considers relat,ionship structure and context instead of entity structure and context. Discovering conflicts between relationship attributes and relationships is carried out by the function RA-R-Conf; this is analogous to RA_E_Conf, the only relevant difference is that it considers the structure and the context, of the relationship instead of the structure and t~he context of an entity. RA-E-Conf;
3. EXTRACTION 3.1.
OF SIMILARITIES
AMONG
OBJECT
CLUSTERS
Preliminaries
In the previous sections we have shown algorithms for recognizing type conflicts between objects. However there are many situations where a certain portion of the world representation is incorporated as two different sets of objects (hereafter called object clusters) within two schemes. In order to single out such similarities we have designed a second group of algorithms which are presented in this section. Preliminarly consider the following definitions: Definition 2 An object cluster is either (i) a relationship cluster. or (ii) a subset cluster, or (iii) a composed cluster. A relationship cluster is a portion of an E/R scheme consisting of a relationship, the entities it links and all involved attributes. A subset cluster is a portion of an E/R scheme consisting of an entity. the entities linked to it by an is-a relationship and all involved attributes. A composed cluster is a cluster including in turn two object clusters sharing at least, one entity.
A Uniform Methodology
for Extracting
Type Conflicts and Subscheme Similarities
541
Definition 3 Given an object cluster C, the structure of C consists of objects (attributes, entities and relationships) belonging to C; the context of C consists of the difference between the union of the cluster’s object contexts and objects belonging to the cluster itself (and therefore belonging to the cluster’s structure). In the sequel, we will use functions CStructure(C) and CContext(C), which take in input a cluster C and return the objects belonging to its structure and to its context, resp. For each cluster, our algorithm takes into account both its structure and its context. The algorithm exploits a Dictionary of Lexicographic Synonymy Properties LSPD and a Dictionary of Synonymy Properties SPD (see Section 2.2 for details about these dictionaries). It can be encoded as follows: Algorithm
for extracting object cluster similarities
Input: a list 5’ of n database schemes; a dictionary SPD dictionary LSPD of lexicographic synonymy properties;
of synonymy
Output:
clusters;
OCSD
a dictionary
of similarities
between
object
properties;
a
var ROCS:
a set of (rough)
OCPS:
a set of (interesting)
similarities
between
pairs of object
object
clusters;
clusters;
begin OCPS
:= Select-Interesting_Clusters(S,
ROCS
:= OC_Rough(S,
OCSD
:= OC-Ref
OCSD
:= OC_Discard_Weak(OCSD);
OCPS,
ined(S,
SPD,
OCPS,
SPD); LSPD);
SPD,
ROCS);
return OCSD end In the following 3.2.
Selecting
sections
Interesting
we will describe
into detail the functions
used in the algorithm.
Clusters
The function for selecting interesting clusters is devoted to limit the (potentially exponential) number of possible clusters by determining the most interesting, i.e., the most promising ones. An object cluster can be considered as the minimum subscheme containing two “seed” entities, called seed-entl and seed-ent2. The term “minimum subscheme” indicates the subscheme containing the minimum number of objects, among those comprising both seed-entl and seed_entZ If more than one subscheme exists with the same (minimum) number of objects, that having the maximum number of objects involved in synonymies stored in the SPD is chosen. If more than one exists with this characteristic, that having the greatest plausibility coefficients, associated to the synonymies which the objects belonging to it are involved in, is taken. For each pair of schemes Sh and Sk belonging to S, the function determines all pairs of object clusters (Ci, Cj) such that Ci E Sh, Cj E Sk and both Ci and Cj have been judged promising. More in particular, in a first step, object clusters of Sh are those having, as seed_entl, an entity E, involved in a significant entity synonymy and, as seed_ent2, any other entity El of Sh; given an object cluster in Sh having E, and El as the seeds, corresponding object clusters in Sk are those having, as seed_entl, any entity E, related to E, by a significant synonymy and, as seed-entb, any entity E, related to El by a significant synonymy. The second step is dual to the first one. The function can be encoded as follows: Function Select_Interesting_CIusters(S: a list of n database schemes; SPD: a Synonymy Property Dictionary): return a set of (interesting) pairs of object clusters;
GIORGIO TERRACINA and DOMENICO URSINO
542
var
OCPS:
a set of (interesting)
pairs of object
clusters;
begin for each [IS,., E,, ~E,E,J E SPD do begin
Let Sh be the scheme of E, and let Sk be the scheme of E,; for each entity
El E Sh do
for each [El, E,,
~E,E,,,J E SPD such that E,
C, := Minimum_Cluster(E,,
El);
Cj := Mininl7Lm_Cluster(E,,
E,,);
OCPS
E Sk do begin
Z= OCPS U (Ci, Cj)
end; for each entity
Ep E Sk do
for each [Ep, E,, f E,E,J E SPD such that Eq E Sh do begin Ci := Minimum_Cluster(E,,
Eq);
Cj := Minimum_Cluster(E~,
Ep);
OCPS
1= OCPS U (Ci, Cj)
end end end The function Minimum_Cluster takes in input two seed entities Ei and Ej and yields in output the object cluster (i.e., the “minimum subscheme”) which both E, and E:, belong to.
3.3. Extracting Rough Synonymies between Object Clusters The function for computing rough object cluster similarities determines, for each pair of object clusters, a coefficient stating the associated similarity degree; this is a weighted mean of similarity degrees of attributes, entities, relationships and cardinalities relative to the clusters of the pair. The function is as follows: Function OC_Rough(S: a list of n database schemes; OCPS: a set of (interest.ing) object cluster pairs; SPD: a Synonymy Property Dictionary; LSPD: a Lexicographic Synonymy Property Dictionary): return a set of (rough) object cluster similarities; var Q, T, U, V: a set of reals E [0,11;
TOGS:
a set of similarities
between
object
clusters;
begin
TOCS := 0; for each pair of object
clusters
(Ci, Cj) E OCPS
Qc,cj
Z= OC_RoughAtt_Syn(Ci,Cj,
Tc,cj
:= OC_Rough-Ent_Syn(Ci,
Cj, SPD);
uc,c,
:= OC_RoughRel_Syn(C,,
C,, SPD);
Vc;c,
:= OC_Rough_Card_Syn(S,
Ci, Cj);
do begin
LSPD);
TOCS := TOGS U IjCi, Ci, OC_Rough_Val(Qc,c,, end; return end
TOCS
Tc,c,, iJc,c,, I’c,c,)II
A Uniform
Methodology
for Extracting
Type
Conflicts
Functions OCRougkAttSyn, OC_Rough_Ent_Syn and function EA_E_AttSyn, the only differences being that: l
- L := {AI ( AI is an attribute 1A,
OC_Rough_RelSyn
are analogous
543
to the
and Ai E CStructure(Ci)}
is an attribute
and A,,, E CStructure(Cj)J
in OC_Rough_Ent_Syn: - L := {El 1El is an entity - M := {E, - SP:=U
l
Similarities
in OC-Rough_Att_Syn:
- M := {A, l
and Subscheme
1E,
and El E CStructure(Ci)} and E,,, E CStructure(Cj)}
is an entity
LEI,E,.IE,E,JESPD.E,EL.E,EM
in OC-Rough-Rel-Syn: - L := {RI 1RI is a relationship - M := {R, - SP:=U
j R,
and Rl E CStructure(Ci)} and R,
is a relationship
(fb,%n,
LR,.Rm.fR,RmlESPD,R,~L.Rm~M
E CStructure(Cj)} f&R,)
The function OC_Rough_Card_Syn takes in input a list S of schemes and two object clusters Ci and Cj; it considers cardinalities of relationships constituting the structure of clusters and returns a real value E [0,11;the higher the number of relationships of Ci having the same cardinalities of the corresponding synonym relationships of Cj is, the higher the value returned in output is. A simple implementation of the function should return the ratio between the number of synonym relationships having the same cardinality and the total number of synonym relationships. The function OC-Rough-Val takes in input four real coefficients: Qcicj, Tcicj, Ucicj and Vcicj and computes their weighted mean with the coefficients ZQ = 0.5, ZT = 0.3, ZU = 0.15 and 2” = 0.05. 3.4.
Extracting Refined Similarities between Object Clusters
The function for computing refined object cluster similarities is devoted to derive more refined cluster similarities. To this purpose it exploits entity and relationship synonymies, stored in the SPD, and rough object cluster similarities, stored in the ROCS. In more detail the function is as follows: OC_Refined(S: a list of n database schemes; OCPS: a set of (interesting) cluster pairs; SPD: a Synonymy Property Dictionary; ROCS: a set of (rough) cluster similarities): return a set of object cluster similarities;
Function
object object var
T, U: a set of reals E [0,11; TOCS:
a set of similarities
between
object
clusters;
begin
TOCS := 0; for each pair of object
Tcicj
clusters
(Ci, Cj) E OCPS
:= OC_RefinedEnt_Syn(Ci,
Ucicj := OC_RefinedRelSyn(Ci,
do begin
Cj, SPD); Cj, SPD);
Let ]Ci, Cj, fc;c, J be a tuple belonging
to ROCS;
TOCS := TOCS U IlCi,Cj, OC_Refined_VaZ(fc;cj, end; return end
TOCS
Tcicj, Uc;c,)ll
GIORGIO TERRACINA and DOMENICO URSINO
544
The function OC-Refined-Ent-Syn ferences being that: - L := {El 1E 1 1s an entity - M := {E, The function
1E,
is analogous
OC_Rough_Ent_Syn,
to the function
the only dif-
and El E CContext(Ci)} and E,
is an entity
OC-Refined-Rel-Syn
E CContext(C~)}
is analogous
OC_Rough_Rel_Syn,
to the function
the only differ-
ences being that: -
L := {Rl 1Rl is a relationship
- M := {R,
1R,
and RI E CContext(Ci)} and R,
is a relationship
E CContext(C,)}
The function OC-Rejined_Val takes in input three parameters fc,cj, Tc,c, and Uc>cj and computes their weighted mean with the coefficients Zf = 0.65, 2~ = 0.25 and 2~ = 0.1. 3.5.
Discarding
Weak Object
Cluster Sim,ilarities
The function OC_Discard_ Weak takes in input a dictionary OCSD of similarities between object clusters and discards the weak ones; the function is analogous to EA-E-DiscardWeak except that now the threshold th,, is equal to max (CT,OcsDMAx+ocsD~llN ), and CJ has been experimentally set to 0.55. Properties judged interesting by OC_&iscard_ Weak are stored in the final Object Cluster Similarity Dictionary. Some Examples
3.6. 3.6.1.
A First Example
Consider the schemes and the LSPD of Section 2.3.4. Let us apply the algorithm for computing the similarity degree associated to the pair of object clusters Cl = (Employee - Works-Division) and C, = (Subordinate - Operates - Department). At first, the function OC-Rough is called; it activates OC_Rough_Att_Syn which Qclcz = 0.81. After this, functions OC_Rouglb_Ent_Syn, OC_RoughRelSyn and OC_Rough_Card_Syn nally, OC-Rough-V&
returns
are invoked which return Tc,c, = 0.66, Uc1c2 = 0.57 and I/c1c2 = 1. Fireturns 0.5 x 0.81 + 0.3 x 0.66 + 0.15 x 0.57 + 0.05 = 0.75 and the tuple
/lCl, Cz, 0.7511 is inserted in the ROCS. After this, OC-Refined is called; it activates OC_RefinedEnt_Syn, which returns TcIcz = 0.25, and OC-RefinedRel_Syn, which returns Uclcz = 0.4; then, the function OC_Refined_Val is activated which returns 0.59; the tuple l/Cl, C2,0.5911 is added to the OCSD. Finally, the function 3.6.2.
OC-Discard-Weak
judges
the similarity
meaningful.
A Second Example
Consider
the schemes
in Figure
4, taken
from [24].
Ordered - Product) and Cz = (Customer - Places Assume that the LSPD stores the following tuples: [Nwme, Name, By applying the SPD:
the approach [Customer, [Ordered,
l]
[Date, ODate,
presented
Customer, Ordline,
0.851
[Quantity,
in [22], the following
0.72J
0.601
Consider
- Order
the cluster
- Ordline
Qty, 0.951
synonymies
Cl = (Customer
-
- Product).
[P#, P#,
l]
can be found and stored in
[Product,
Product,
[Ordered,
Places,
0.751 0.471
At the beginning, the funct,ion OC-Rough is activated. It, in its turn, calls OC_Rough.Att_Syn, which yields 0.86, OC_RoughEnt_Syn, which returns 0.66, OC-RoughRelSyn, which returns 0.40, and OC_Rough_Card_Syn, which yields 1. Finally, the function OC_Rough_Val returns 0.5 x 0.86 + 0.3 x 0.66 + 0.15 x 0.40 + 0.05 x 1 = 0.74 and the t,uplc Ill:, , Cz. 0.7411 is stored in ROCS. At this point, OCRefined should nevert,heless, extreme cases are:
be called.
In this example
no context
has been specified;
A Uniform Methodology
for Extracting
Type Conflicts and Subscheme Similarities
I
Name-
.
545
--op#
Tc,c, = 0 and UcIcz = 0, in which OC_Refined_Val returns 0.65 x 0.74 + 0 + 0 = 0.48; consequently the OCDiscard_Weak discards the corresponding similarity;
function
case the function
. the two contexts OC_Ref
are very similar so that Tc,c, = 1 and Uclca = 1, in which case the function returns 0.65 x 0.74 + 0.25 x 1+ 0.1 x 1 = 0.83; in this case the object cluster is accepted and stored in the corresponding dictionary by OC_Discard_Weak.
ined_Val
similarity
is able to automatically discover very difficult cluster This example shows that our approach _ _ similarities 1241. Moreover the example shows how our algorithm is sensible to the context (and therefore to the scheme semantics [B]); therefore its choices are not only syntactic but also semantic (see [6, 8, 221). 4. A REAL
CASE EXAMPLE
This section is devoted to show how our approach scales to real example cases. In particular we illustrate here one of the experiments we have carried out on ICGO databases. Database schemes of ICGO have been the main test set that we have used to verify the effectiveness of our techniques in deriving interscheme properties. This set includes heterogeneous databases, both in the data model and languages (e.g., hierarchical, network, relational), and as far as their structure and complexity are concerned (ranging from simple databases with schemes including few objects, to very complex databases). The ICGO databases into consideration are: Banca Dati Legge 185 - Credit0 Agevolato per piccole e medie imprese (Database of Facilitated Credit for small and medium firms), hereafter FCD. The scheme of this database is shown in Figure 5. Banca
Dati Legge 317 - Contributo
of Sunk database
Contributions for small is shown in Figure 6.
a Fondo
and medium
Banca Dati Legge 46 - Richieste Contributions), hereafter FRCD.
Albo delle Imprese
Figure
9.
(Firm Database),
per piccole
firms),
hereafter
e medie imprese SCD.
(Database The scheme of this
di imprese
per contributi (Database of Firm Requests The scheme of this database is shown in Figure 7.
Banca Dati Obiettivi CEE - Contributi
of European Union Contributions is shown in Figure 8.
Perduto
for
Unione Europea per Piccole e Medie Imprese
for small and medium
hereafter
FD.
firms).
(Database The scheme of this database
The scheme of this database
is shown in
GIORGIO TERRACINA and DOMENICO URSINO
546
Fig. 5: Database
of Facilitated
Credit
and Medium
Firms
(FCD)
I
I
Fig. 6: Database
for Small
of Sunk
Contribution
for Small
and Medium
Firms
(SCU)
A Uniform
Methodology
for Extracting
Fig. 7: Database
Fig. 8: Database
of European
of Firms
Type Conflicts
Requests
Union Contributions
and Subscheme
for Contributions
Similarities
(FRCD)
for Small and Medium Firms
(EUCD)
547
and DOMENICO
GIORGIO TERRACINA
548
URSINO
I Teehnicrl Manager
Historical Register
Company
Fig.
First
Obpxt
Producing Factory
9: Firm
Database
Second ActivityIF,,]
[FCD]
Contribution[Fc,pl Machine[FcDl FirmwDl Credit InstitutqFcDl Factorypcq FirmlFcDl Producing ActivityfFc,l Factorypcq Firm[FcD] Firmi.-cnl Factory[scq FiqscDl Producing Activityls,,l Factory,scD1 Firqscr,l FirqscDl Factory[FRcDl FirowRcDl Progra1q,,cLq FiroqFRCD] Firm[Eucnl Table
1: Synonymies
(FD)
Coefficient
Object
Producing Activity[sc,i Factory[scDl Contributionfscol Machinqscol Firm[scp Credit InstitutqscDl Factory[FRcol FirmpRcDl Producing Activity[El,CD] Factowpucq Firm[EucDl
ComPanY[F,] FactorYpRcD] Firm[FRcq
Producing ActivityLEucD] FactorylEucDl Firmwucol CompanylFDl FactorylEucDl Firm(Eucul Program(8acol ComPanYIFD] ComwnYfFDl
Derived
for ICGO
0.98 0.94 0.90
0.87 0.85 0.80 0.80 0.72 0.94 0.93 0.86 0.54 0.78 0.66 0.92 0.90 0.76 0.52 0.80 0.67 0.59 0.47 0.54
Schemes
A Uniform Methodology
y irst Object
Cluster
( Producing
Activity
ff
for Extracting
Carries Out c)
1 ( Producing
A Factory tf Carries Out tt et Prod&g Activity )~FCO~ ( Machine ++ For t-+ Factory t) et Carries Out ff Producing Activity )rpc~, ( Contribution ++ Asks For et Factory )lFc~l ( Contribution t) Asks For ct Factory ct ct Carries Out ct Producing Activity fs t-+ Machine t) For )IFCD, _ ( Machine tt For fs Factory +t tt Carries Out ct Producing Activity tf c) For et Contribution )lFcDl ( Contribution t) Asks For t) A Factory ts Carries Out tt et Producing Activity tt Machine tt
* For )[FCDJ
+t Carries Out ct
I ( Contribution 1 4-kF act orv t)
++ Asks For c1 Carries Out et ++ Prodoling Activity )ISCD~ ( Machine ct For c) Factory c) tf Carries Out ++ Producing Activity )I~c~, ( Contribution ++ Asks For t+ Factory ),sco, ( Machine e, For +t Factory ++ t, Carries Out et Producing Activity et cs For t) Contribution ),SCD, ( Contribution +t Asks For ++ Factory tt cs Machine +-+ For tt t) Carries Out t+ Producing Activity ),SCD, ( Contribution c1 Asks For e, & Factory CI Carries Out c1 ct Producing Activity t+ Machine tt
I 1 * For )[SCD~ I ( Contribution cf
Asks For e, t3 Factory ct Machine tt For cf ct Carries Out ++ Producing Activity
t)
++For )~CDI
( Contribution tt Asks For H t) Factory c) Machine t, For t) ct Carries Out tf Producing Activity ( Machine fs For tf Factory ++ tf Carries Out tt Producing Activity ff For tt Contribution )fFC~, . , ( Contribution tt Asks For c) tt Factory c1 Machine ++ For tf t) Carries Out C) Producing Activity ( Machine c) For fs Factory )LFCDI
1 0.93 I 0.89
I 0.87 0.86 0.85
0.85
0.85
I I
( Contribution c) Asks For ++ tf Factory fs Carries Out ++ t) Producing Activity tt Machine
fs
0.85 ct 0.85 tt
++For )ISCDI I ( Machine +) For ),FCDJ
I
tf Factorv c) ++ Carries Out tf Producing Activity tf tt For N Contribution )lscD, ( Producing Activity ct Carries Out cf tf Factory )~SCD~ Contribution tf Asks For ct tf Factory c) Machine ++ tt For tf Carries Out t) c) Producing Activity )[SCDI l Machine c) For +-t Factorv tt tt Carries Out tf Producing Activity f) tf For tf Contribution )[~col ( Factory tt Has tf Firm )[FRCD, ( Producing Activity e, Carries Out +t ++ Factory )IEUCDI ( Producing Activity ct Carries Out tf t-i Factory cf Has t-t
( Factory tt Has tt Firm )IFRCDI ( Producing Activity t) Carries Out ft * Factory )[EUCOI ( Factory t) Has c) Firm )[E~CDI Cluster
Similarities
Derived
for ICGO
0.85
)[SCD~
i+ For )ISCD~
( Firm tt Asks For ++ Factory )ISCD~ ( Producing Activity f) Carries Out tt ct Factory )lscD1 ( Factory c) Has tt Firm )[FRCD, 2: Object
( Contribution t) Asks For er tt Factory tt Carries Out tt t) Producing Activity tt Machine
)L~CDI
( Contribution C) Asks For C) c) Factory ++ Machine et cs For c* Carries Out c, et Producing Activity )[FCD) l Machine tf For tt Factorv ++ cf Carries Out +-+ Producing Activity cf ct For ct Contribution )~FCDI ( Factory f+ Asks For e, Firm )[FCD, ( Producing Activity t) Carries Out c) et Factory )IFCDI ( Producing Activity tt Carries Out c) et Factorv tt Asks For tt
Table
Activity
549
tt Factory ),scD,
cf Factory ),pcDl ( Contribution tt Asks For ct
( Contribution tt Asks For t) A Factory ++ Carries Out c) tf Producing Activity tt Machine
Type Conflicts and Subscheme Similarities
Schemes
I I
0.85
0.85 0.85
I I
0.85
0.80 0.90 0.85
0.66 0.88 0.76
GIORGIO TERRACINA and DOMENICO URSINO
550
By applying
the approach
presented
in [22], we derive the synonymies
shown in Table
The application of the algorithm for detecting type conflicts presented the presence of a type conflict between the attribute Category of the entity FRCD and the entity Category of the database EUCD. Finally, by applying object cluster similarities
lt.
in Section 2 derives Firm in the database
to the five ICGO schemes illustrated above the algorithm for detecting presented in Section 3 we have obtained the results shown in Table 2t.
5. RELATED
WORKS
In the literature other techniques have been proposed for the construction of global dictionaries describing inclusions, synonymies and other properties between scheme objects (e.g., [8]). Our methodology derives similar properties but also associates to each property a coefficient stating its plausibility. These coefficients allow each application exploiting interscheme properties (such as scheme integration and scheme abstraction) to obtain more reliable results. In the literature there are very few papers dealing with the detection of type conflicts (e.g., [4, 15, 241). In (41 a technique is proposed to solve all kinds of type conflicts, but this paper supposes type conflicts have been detected manually by human experts and, therefore, does not propose any approach for detecting them. On the contrary [24] proposes a technique for detecting type conflicts and object cluster similarities in a semi-automatic fashion; as illustrated also by various examples, techniques presented in this paper are able to detect all kinds of type conflicts derived by techniques described in [24], but in a more automatic fashion. Our approach has some similarities with that presented in [15] but: (i) [15] derives only synonymies, hyponymies and type conflicts (and not object cluster similarities); (ii) it requires a heavier intervention of the human expert since he must validate one “assumed similarity” at each step (whereas our algorithm requires the intervention of the human expert for validating derived properties only after the extraction of all of them); (iii) finally, we take advantage of the background knowledge whereas this is a lack of [15], as denoted by the authors themselves. The approach presented in this paper is complementary to that, described in [23]: indeed [23] improves the description of types by associating a semantics to each of them in order to explain the terms of application domain and their interrelations; these improvements on type description could be exploited for refining the part of our approach which detects type conflicts. In addition the approach for semantic interoperability presented in [23] assumes the existence of a shared ontology for schemes into consideration and admits that its construction is a difficult task; the part of our approach detecting nominal properties between scheme objects (already presented in [22]) can provide this ontology. In [21] a general approach for automatically deriving type conflicts is described based on an abstract fixpoint procedure. Here we provide a concrete technique which takes some ideas from [21] but takes into account also the guidelines of [22], which led to successfully derive nominal and structural properties. In addition we extend the approach of [al, 221 for deriving also object cluster similarities. The final result of this effort consists of the definition of a uniform framework for deriving all most common kinds of interscheme properties. In [20] a methodology for deriving object cluster similarities is proposed. This approach computes the similarity coefficients associated to each object cluster pair by taking into account only the similarities between entities belonging to involved clusters. The approach proposed here is much more precise because, for each cluster pair, it considers not only objects belonging to the structure of involved clusters, but also those belonging to their context. This choice fits the recent developments relative to the theory of interscheme property extraction (see [8, 121) where it was proved that properties derived without considering contexts are not completely reliable and it was stressed the importance of the attribute semantic relevance in the extraction of interscheme properties. tin this table 0~~1 indicates the object 0 of the scheme S. tin this table (C)fS] indicates the cluster C of the scheme S
A Uniform Methodology
for Extracting
Type Conflicts and Subscheme Similarities
551
6. CONCLUSIONS In this paper we have proposed a new automatic approach for deriving type conflicts and object cluster similarities in database schemes. Our method, unlike most of the previous ones, is automatic and, thus, it is well suited for large integration problems where manual approaches are difficult to be applied. In particular the approach described in this paper for type conflicts and object cluster similarities follows the same leading ideas as the methodology presented in [22] for deriving nominal properties. As a consequence, in the whole, we have obtained a uniform semi-automatic approach for deriving nominal properties, type conflicts and object cluster similarities. The method is based on considering pairs of objects having different types (resp., pairs of clusters) belonging to different schemes and on measuring their similarity. To this purpose, object (resp., cluster) structures as well as object (resp., cluster) contexts are analyzed to verify similitudes and differences. Some examples have been proposed in the paper for illustrating the suitability of our approach to effectively detect type conflicts and object cluster similarities. The effectiveness of the proposed approach has been tested over numerous application cases; in particular our main test case has been the set of Italian Central Governmental Office databases; in this paper we have shown the application of our algorithms to five of these databases. Presently we are completing the integration of this approach into a more complex system, called D.I.K.E. (Database Intensional Knowledge Extractor), that we are developing at D.E.I.S. Universita della Calabria, whose overall approach consists of the following steps: l
l
l
the enrichment of scheme description, obtained by (semi-automatically) extracting nominal and structural properties, type conflicts, object cluster similarities and assertions involving complex patterns relative to intensionally defined objects; the exploitation of derived interscheme properties for obtaining, in a semi-automatic fashion, an integrated and abstracted representation of available data encoded in a Data Repository; the exploitation of both the repository and interscheme properties, derived in the previous steps, for supporting the designer in realizing mediator-based Cooperative Information Systems or Data Warehouses over available data.
Acknowledgements The authors gratefully acknowledge the Autorith per I’Informatica nella Pubblica Amministrazione (AIPA) for kindly providing the schemes of ICGO and technical support with them. The authors gratefully thank Luigi Palopoli and D. Sacc?~ for many inspiring discussions about the arguments of the paper. Domenico Ursino is also supported by ISI-CNR.
REFERENCES [l] C. Batini, S. Castano, V. De Antonellis, M.G. Fugini, and B. Pernici. Analysis of an inventory of information Requirement Engineering Journal, l( 1):47-62 (1996). systems in the public administration. [2] C. Batini, S. Castano, M.G. Fugini, P. Naggar, M. Pellizzoni, and B. Pernici. Tecniche per l’analisi di descrizioni di processi nella pubblica amministrazione. In Atti de1 Congress0 Ann&e dell’AICA (AICA ‘g5), pp. 247-258, Cagliari, Italy. (In Italian) (1995). [3] C. Batini, S. Ceri, and S.B. Navathe. Company (1992).
Conceptual
Database
Design.
[4] C. Batini and M. Lenzerini. A methodology for data schema integration Pansactions on Software Engineering, 10(6):650-664 (1984). [5] C. Batini, M. Lenzerini, and S.B. Navathe. A comparative integration. ACM Computing Surveys, 15(4):323-364 (1986).
The Benjamin/Cummings
Publishing
in the entity relationship model. IEEE
analysis of methodologies
for database
[6] M.W. Bright, A.R. Hurson, and S. Pakzad. Automated resolution of semantic heterogeneity ACM tinsactions on Database Systems, 19(2):212-253 (1994). [7] S. Castano and V. De Antonellis. Reference conceptual architecture for re-engineering International Journal of Cooperative Information Systems, 4(2):213-235 (1995).
scheme
in multidatabases.
information
systems.
[8] S. Castano and V. De Antonellis. Semantic dictionary design for database interoperability. In Proc. of Intemational Conference on Data Engineering (ICDE’g7), pp. 43-54, Birmingham, United Kingdom. IEEE Computer Society (1997).
GIORGIO TERRACINA and DOMENICO URSINO
552 [9] S. Castano, applications.
V. De Antonellis, M.G. Fugini, and B. Pernici. Conceptual Technical Report, Politecnico di Milano, pp. 95-160 (1995).
schema analysis:
[lo]
T. Catarci and M. Lenzerini. Representing and using interschema knowledge in cooperative Journal of Intelligent and Cooperative Information Systems, 2(4):375-398 (1993).
[ll]
S. Chaudhuri and U. Dayal. An overview of data warehousing and olap technology. 26(1):65-74 (1997).
[12] P. Fankhauser, M. Kracker, and E.J. Neuhold. RECORD, 20(4):59-63 (1991). [13] Z. Galil. (1986).
Efficient algorithms
Techniques
and
information systems.
ACM SIGMOD
RECORD,
Semantic vs. structural resemblance of classes. ACM SIGMOD
for finding maximum
matching in graphs.
ACM
Computing
Surveys,
18:23-38
D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, V. Vassalos, and [14] H. Garcia-Molina, Y. Papakonstantinou, J. Widom. The tsimmis approach to mediation: Data models and languages. Journal of Intelligent Information Systems, 8:117-132 (1997). [15] W. Gotthard, P.C. Lockemann, and A. Neufeld. System-guided view integration for object-oriented IEEE Transactions on Knowledge and Data Engineering, 4(1):1-22 (1992). [16] J.A. Larson, S.B. Navathe, and R. Elmastri. A theory of attribute equivalence in databases to schema integration. IEEE Transactions on Softvrore Engdneering, 15(4):449-463 (1989).
databases.
with application
[17] A. Levy, A. Rajaraman, and .I. Ordille. Querying heterogeneous information sources using source descriptions. In Proc. of 22th International Conference on Very Large Data Bases (VLDB’SG), pp. 251-262, Bombay, India. Morgan Kaufmann (1996). [18] M.V. Mannino and W. Effelsberg. Matching Conference on Data Engineering (ICDE’Sd), Society (1984). [19] S.B. Navathe, R. Elmasri, 19(1):50-62 (1986).
and J.A. Larson.
techniques in global schema design. In Proc. of International pp. 418-425, Los Angeles, California, USA. IEEE Computer Integrating
user views in database
design.
IEEE
Computer,
[ZO] L. Palopoli, D. Sac&, G. Terracina, and D. Ursino. A unified graph-based framework for deriving nominal interscheme properties, type conflicts and object cluster similarities. In Proc. of Fourth IFCIS Conference on Cooperative Information Systems (CoopIS’SS), pp. 34-45, Edinburgh, United Kingdom. IEEE Computer Society (1999). [21] L. Palopoli, D. Sac& and D. Ursino. An automatic technique for detecting type conflicts in database schemes. In Proc. of ACM Conference on Information and Knowledge Management (CIKM’98), pp. 306-313, Bethesda, Maryland, USA. ACM Press (1998). techniques for deriving int,erscheme properties [22] L. Palopoli, D. Sac& and D. IJrsino. Semi-automatic database schemes. Data 0 Knowledge Engineering, 30(4):239%273 (1999). [23] E. Sciore, M. Siegel, and A. Rosenthal. Using semantic values to facilitate interoperability information systems. ACM Transactions on Database Systems, 19(2):254-290 (1994).
from
among heterogeneous
[24] S. Spaccapietra and C. Parent. View integration: A step forward in solving structural conflicts. actions on Knowledge and Data Engineering, 6(2):258-274 (1994).
IEEE
Tmns-
[25] J.D. Ullman. Information integration using logical views. In Proc. of International Conference on Database Theory (ICDT’97), pp. 19-40, Delphi, Greece. Lect,ure Notes in Computer Science, Springer-Verlag (1997). [26] J. Widom. Knowledge
Research problems in data warehousing. In Proc. of International Conference on Information Management (CIKM’98), pp. 25-30, Baltimore, Maryland, USA. ACM Press (1995).
[27] G. Wiederhold. (1992).
Mediators
in the architecture
of future information
systems.
IEEE
Computer,
and
25(3):38-49