CroMatcher: An ontology matching system based on automated weighted aggregation and iterative final alignment

CroMatcher: An ontology matching system based on automated weighted aggregation and iterative final alignment

Accepted Manuscript CroMatcher: An ontology matching system based on automated weighted aggregation and iterative final alignment Marko Guli´c, Boris ...

894KB Sizes 0 Downloads 56 Views

Accepted Manuscript CroMatcher: An ontology matching system based on automated weighted aggregation and iterative final alignment Marko Guli´c, Boris Vrdoljak, Marko Banek PII: DOI: Reference:

S1570-8268(16)30036-1 http://dx.doi.org/10.1016/j.websem.2016.09.001 WEBSEM 421

To appear in:

Web Semantics: Science, Services and Agents on the World Wide Web

Received date: 24 September 2015 Revised date: 11 July 2016 Accepted date: 12 September 2016 Please cite this article as: M. Guli´c, B. Vrdoljak, M. Banek, CroMatcher: An ontology matching system based on automated weighted aggregation and iterative final alignment, Web Semantics: Science, Services and Agents on the World Wide Web (2016), http://dx.doi.org/10.1016/j.websem.2016.09.001 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

*Manuscript Click here to view linked References

CroMatcher: an Ontology Matching System Based on Automated Weighted Aggregation and Iterative Final Alignment Marko Guli´ca,∗, Boris Vrdoljakb , Marko Banekb,c,1 a University

of Rijeka, Faculty of Maritime Studies, Studentska 2, HR-51000 Rijeka, Croatia of Zagreb, Faculty of Electrical Engineering and Computing, Unska 3, HR-10000 Zagreb, Croatia c Ericsson Nikola Tesla d.d., Krapinska 45, HR-10000 Zagreb, Croatia

b University

Abstract In order to perform ontology matching with high accuracy, while at the same time retaining applicability to most diverse input ontologies, the matching process generally incorporates multiple methods. Each of these methods is aimed at a particular ontology component, such as annotations, structure, properties or instances. Adequately combining these methods is one of the greatest challenges in designing an ontology matching system. In a parallel composition of basic matchers, the ability to dynamically set the weights of the basic matchers in the final output, thus making the weights optimal for the given input, is the key breakthrough for obtaining first-rate matching performance. In this paper we present CroMatcher, an ontology matching system, introducing several novelties to the automated weight calculation process. We apply substitute values for matchers that are inapplicable for the particular case and use thresholds to eliminate low-probability alignment candidates. We compare the alignments produced by the matchers and give less weight to the matchers producing mutually similar alignments, whereas more weight is given to those matchers whose alignment is distinct and rather unique. We also present a new, iterative method for producing one-to-one final alignment of ontology structures, which is a significant enhancement of similar non-iterative methods proposed in the literature. CroMatcher has been evaluated against other state-of-the-art matching systems at the OAEI evaluation contest. In a large number of test cases it achieved the highest score, which puts it among the state-of-the-art leaders. Keywords: ontology matching, ontology matching system, parallel composition, automated weighted aggregation, ontology alignment ∗ Corresponding

author: Fax: +385 51 336 755 Email addresses: [email protected] (Marko Guli´ c), [email protected] (Boris Vrdoljak), [email protected], [email protected] (Marko Banek ) 1 Presently at Ericsson Nikola Tesla, the research was done while working at the University of Zagreb

Preprint submitted to Journal of Web Semantics

July 11, 2016

1. Introduction 2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

32

34

36

38

40

42

The amount of available data has increased rapidly due to advances in information and communications technology. Consequently, new data sources describing the same domain of interest have been emerging constantly, yet being designed independently of each other and thus mutually heterogeneous. At some later point, such heterogeneous data sources describing the same domain of interest frequently need to be coupled. An ontology enriches the knowledge on a data source by providing a detailed description of entities and their mutual relations within the domain of interest. Thus, the use of ontologies facilitates the integration of heterogeneous data sources that belong to the same domain. In computer science and information science, ontology is a formal, explicit specification of a shared conceptualization [1]. Ontology matching is the process of finding correspondences between entities of different ontologies [2]. Ontology matching is a key issue in the process of integrating heterogeneous data sources described by ontologies: if the ontology matching process is accomplished successfully, the management of data coming from different sources becomes much easier [2]. In order to automate the ontology matching process, an ontology matching system has to be developed [2]. The need of performing ontology matching at the highest possible level of quality will be demonstrated through an example in the domain of data integration. A typical case of data integration that includes coupling different data sources is the integration of selling catalogs for business-to-business (B2B) transactions in e-business [2]. A typical e-business participant owns a website that includes a catalog with the features of each product. When another e-business participant wants its products to be sold on the website of the former, their catalogs need to be harmonized. A catalog matching process (i.e. a process of finding correspondences within the catalogs) has to be performed in order to overcome the heterogeneity between the catalogs. In order to automate the matching process as much as possible, it would be highly beneficial to describe each catalog by an ontology, which is designed to capture semantics and offers high degree of formality. With the employment of ontologies, the matching software could better understand the data within the catalogs and thus perform the matching better. Overcoming the semantic heterogeneity among the catalog data turns into overcoming heterogeneity among ontology entities. Accordingly, in order to automate the process of matching heterogeneous data sources by using ontologies and ontology matching, the ontology matching process itself also needs to be automated to the greatest possible extent. Thus, the objective of the research presented in this paper is to enhance the degree of automation in the ontology matching process, while still being mindful to retain (i.e. not to decrease) the high quality of the proposed correspondences. This will lead to a more easy and straightforward usage of matching systems, making them applicable to a broad range of users in future. It is of key importance to automate those parts of the matching process that are not comprehensible to 2

44

46

48

50

52

54

56

58

60

62

64

66

68

70

72

74

76

78

80

82

84

86

88

ordinary (i.e. non-expert) users. In this way, the users will be able to apply the system in the future without knowing the inner steps of the matching process. Many ontology matching systems actually perform the complex matching task by applying several basic matchers, which determine the correspondences between particular entities (classes, properties, instances) of the ontologies submitted to the matching process [2]. Since each basic matcher computes correspondences using information obtained from one or more segments of the entire ontology, the most common practice is to employ multiple basic matchers in order to utilize all information held within the ontologies. One of the fundamental problems in designing ontology matching systems is the aggregation of the correspondences produced by various basic matchers. Basic matchers are usually executed independently of each other, while the aggregated correspondence for all basic matchers is computed afterwards. When calculating the aggregated correspondence between each two entities of two different ontologies, the results from all basic matchers must be taken into account. The problem is how to determine the importance of every basic matcher [3]. Possibly, the given ontology may lack some of the components, or there might be components to which particular basic matchers are not applicable. A basic matcher that uses those components for its correspondence computation process will either produce no result, or, worse, produce a poor and questionable result, rightfully expected to have low correspondence values. We believe that the correspondences determined by other methods, which were able to perform correspondence calculations with high quality must have a greater influence in the aggregated result. Hence, correspondences produced by each basic matcher must be accompanied with a weighting factor. Another challenge arises when the system needs to determine the final alignment, a set of correspondences between entities of compared ontologies based on the aggregated correspondences between every two entities of the compared ontologies. The process of determining the final alignment needs to be automated due to a large number of matching possibilities between the entities. In this paper we provide three contributions. First, we propose some major enhancements to the process of aggregating the basic matchers, which we initially designed in [4, 5]. The aggregation process consists of two steps: in the first step our Autoweight method automatically determines the weighting factor for each basic matcher taking part in the aggregation, while in the second step the aggregated correspondences are computed using weighted aggregation parameters set in the previous step. In this paper we present Autoweight++, the enhanced version of the Autoweight method described in [4, 5]. A key novelty is a new rule for selecting relevant correspondences that participate in the calculation of weighting factors. Another major enhancement is the solution to the problem of nonexistent correspondences during the aggregated correspondence calculation by means of weighted aggregation (nonexistent correspondences between two entities occur when a particular basic matcher is inapplicable, i.e. unable to produce results). A nonexistent correspondence is replaced with the average of the correspondences between the two entities obtained by other basic matchers. 3

90

92

94

96

98

100

102

104

106

108

110

112

114

116

118

120

122

124

126

As the second contribution, we introduce a new, iterative method for producing the final alignment between the compared ontologies. In each iteration, only correspondences that have the maximum value for both ontology entities (with respect to all entities from the other ontology) will be included in the final alignment. Third, we designed, implemented and evaluated a software system named CroMatcher, which automatically performs all phases of the ontology matching process and includes the two methods mentioned above. CroMatcher applies nine basic matchers, which fully exploit the information contained in the submitted ontologies in the correspondence calculation process. A majority of ontology matching systems [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18] apply similar basic matchers, meaning that similar measures are used for correspondence calculation on one hand, and the same ontology components on the other. Thus, the greatest challenge is to determine a combination of methods that will make the best possible use of the information contained in the ontologies for which the correspondences are calculated. We elaborated our own unique combination of basic matchers in such way that some of them determine the correspondences from the information derived from one ontology component, whereas the others use more components at the same time. The original idea of our system architecture was presented in [5], but since then many enhancements have been made, which will be explained in detail (together with the entire software system) in Section 4. Furthermore, we evaluated our CroMatcher system, so that the results of ontology matching produced by CroMatcher may be compared with the matching results obtained by the execution of other state-of-the-art systems. In particular, we compared the current version of our system with the CroMatcher – IJMSO system [5], which is its initial version. Comparison is based on the Benchmark biblio test set, which is part of the evaluation infrastructure managed by the Ontology Alignment Evaluation Initiative (OAEI) [19, 20]. The comparison of the achieved results proves a notable and significant quality of our prototype system, as well as the quality of methods which constitute the system. The paper is organized as follows. In Section 2 basic terminology of ontology matching and matching system architecture is introduced. In Section 3 we discuss the related work. Section 4 explains in detail our ontology matching system CroMatcher: its basic matchers, the aggregation process based on our Autoweight++ method and our new final alignment calculation method. In Section 5 the evaluation of our matching system is performed considering stateof-the-art matching systems. Finally, the conclusion is given in Section 6.

128

2. Ontology matching

130

2.1. OWL As it has already been stated, an ontology is a formal explicit specification of a shared conceptualization of a domain [1]. The term conceptualization refers to an abstract model of a real world area as comprehended by humans. Explicit

132

4

134

136

138

140

142

144

146

148

150

152

154

156

158

160

162

164

166

168

specification refers to the explicit terms and definitions that describe the concepts and the relations of the abstract model. Each ontology is expressed by using an ontology language that provides definition of the concepts and relations in the real world domain that have to be described by the ontology. One of the most popular ontology languages is Web Ontology Language (OWL) [21], recommended by W3C (World Wide Web Consortium) [22] as an international standard for ontology representation. The test ontologies used for the matching system evaluation organized by OAEI are also encoded in OWL. Therefore, our matching system supports matching between ontologies that are expressed in OWL. Most of the elements of an OWL ontology concern classes, properties, class instances and relations between those instances [23]. The basic components of an OWL ontology, according to [24], are listed below. A class defines a group of individuals that belong together, because they share some properties. Classes can be organized in a specialization hierarchy using the feature subClassOf. Individuals are instances of classes, and properties may be used to relate one individual to another. Properties can be used to state relationships between individuals (ObjectProperty) or from individuals to data values (DatatypeProperty). Properties can also be organized in a specialization hierarchy using feature subPropertyOf. Every class and property has its own ID (ID is the last part of a URI after the mark #), and can be described with annotations (label - the name of class or property, comment - the description). Every property has its own domain and range. The domain of a property limits the individuals to which the property can be applied. The range of a property limits the individuals that the property may have as its value. It is possible to specify various property characteristics (TransitiveProperty, SymmetricProperty, FunctionalProperty etc.). OWL allows restrictions to be placed, related to the way that properties can be used by individual (allValuesFrom, someValuesFrom, minCardinality, maxCardinality, cardinality). Our ontology matching system is based on T-Box matching related to the matching of ontological classes and properties [25] (as opposed to A-Box matching related to the matching of instances). Classes and properties are often referred to as entities. In this paper, when the term entity is used, it represents either a certain class or a certain property within the ontology. The term class entity represents only a certain class within the ontology. On the other hand, when the term property entity is used in the paper, then it represents only a certain property within the ontology. 2.2. Terminology

170

In this subsection, the basic terms of ontology matching, adopted from [2], are presented. Definition 1 (Ontology matching). Ontology matching is the process of finding semantic relationships or correspondences between entities of different ontologies. Ontology matching is defined as function: A′ = f (O, O′ , A, p, r) 5

(1)

172

174

176

where alignment A′ is the matching result between two ontologies, ontologies O and O′ are ontologies that have to be matched, p is a set of parameters in the ontology matching process, and r is a set of resources and basic matchers that are used in the ontology matching process. Alignment A is the initial set of correspondences between two ontologies and is not always available. Definition 2 (Correspondence). Correspondence is a probability value describing the degree of equivalence between entities of different ontologies. A correspondence is defined as: c(ei , e′j ) = n (2)

178

180

182

184

186

188

190

192

194

196

198

200

202

204

206

208

where ei is an entity of the ontology O, e′j is an entity of the ontology O′ , and n is a real number from the interval [0, 1]. The higher the correspondence, the greater the correlation between two entities. Definition 3 (Alignment). Alignment is a set of all correspondences c(ei , e′j ) between entities ei of ontology O and entities e′j of ontology O′ that are found in the ontology matching process. Alignment is actually the output of an ontology matching process. 2.3. Ontology matching system There is a large number of ontology matching systems proposed in the literature [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]. In general, the matching process can be divided into three main components [2]: • Basic matchers - An ontology matching system generally consists of several basic matchers. Each basic matcher uses information from one or more ontology components to determine correspondence between the two entities of different ontologies. Therefore, many ontology matching systems comprise several basic matchers that fully utilize all possible information from ontologies in order to improve the process of determining correspondences between entities. Basic matchers are divided into element-level and structure-level matchers according to the fundamental classification defined in [2]. Element-level matchers determine correspondences by analyzing entities or instances of those entities individually, while structurelevel matchers determine correspondences between entities by analyzing their relations with other entities or their instances. • Compositions of basic matchers and aggregation methods for basic matchers’ results - Basic matchers should be interconnected within the matching system in order to calculate correspondences between ontology entities with higher quality. The most popular compositions of the basic matchers are sequential and parallel composition. Depending on the selected composition of basic matchers, an appropriate aggregation method has to be implemented in the matching system in order to aggregate the correspondences obtained by the basic matchers. The value of aggregated correspondence between two entities is determined by all correspondences 6

210

212

214

216

218

220

222

224

226

228

230

232

234

236

238

240

242

244

246

248

250

252

between these two entities that are obtained by the execution of particular basic matchers. • Methods for final alignment - Once the aggregated correspondences between all entities of the compared ontologies have been determined, a certain appropriate portion of the correspondences needs to be selected and included into the final alignment of the ontology matching system. Hence, the correspondences representing the eventually matching ontology entities should be selected from the set of all obtained correspondences between every two entity pairs of the two ontologies. The development of ontology matching systems has accelerated during the last few years [26]. Although the systems now achieve better results, there is still a number of challenges that need to be resolved. One of the most important challenges is the selection, adjustment and combination of components (basic matchers, aggregation methods, final alignment methods, etc.) in ontology matching systems. The main contribution of this paper is a solution for this matching challenge in order to enhance the quality of results. There is a large number of already existing basic matchers for ontology matching. When a matching system is built, a great challenge is to decide which basic matchers to choose as a part of the matching system. It is important to choose a set of basic matchers that will use all available information within the ontologies that can help in finding correspondences between those ontologies. The efficiency of a basic matcher depends on the particular ontology it is applied to. For example, basic matchers that determine correspondences between ontologies by comparing comments of entities would not determine valid correspondences if there are no comments in one of the ontologies. In order to exploit the results obtained by each basic matcher in the most beneficial way, basic matchers should be combined appropriately. Basic matchers are usually connected through a sequential-parallel composition. First, basic matchers that find correspondences based on individual entity information are executed independently in a parallel composition of matchers. The obtained results are then used to execute basic matchers that find correspondences based on ontology structure. In a parallel composition, it is important to recognize the basic matcher that achieved the best matching results in the current matching process in order to improve the quality of the whole matching process. Accordingly, basic matchers that achieved better matching results for the ontologies submitted to the matching process, are given greater importance. In other words, in the remainder of the matching process, the matching system relies on basic matchers that achieved better results in order to get the best possible final results of correspondences between entities. Identifying the quality of the results of the basic matchers must be automatic. The higher the values of correspondences, the better the basic matcher. If the system does not automatically determine which basic matchers have achieved better results for the ontologies being compared, the user will need to employ not only her domain-expert knowledge, but also additional skills pertaining to an ontology engineer: knowledge of the entire matching process, the operating mode of basic matchers, etc. Our intent is to 7

254

256

258

260

262

264

266

268

270

272

274

276

278

280

282

284

286

288

290

292

294

296

exclude the user from this part of matching process and include her only in these parts of the matching system where she does not have to be an expert for ontology matching in order to use the system. When the quality of the matching results for every basic matcher is determined, the resulting correspondences (for all basic matchers within the parallel composition) of two compared entities need to be aggregated into a single common correspondence. In this paper we present our Autoweight++ method for automatically determining weighting factors in the weighted aggregation of parallel composition of basic matchers. Autoweight++ extends the initial version, which we proposed in [4, 5] as the Autoweight method and will be explained in detail in Section 4.2. When the correspondences between all entities are determined, the correspondences that will be a part of the final result of matching process (final alignment) have to be selected. The challenges that need to be resolved when selecting the correct correspondences for the final alignment are the correspondence threshold and a number of correspondences of the same entity within the final alignment. Our new method for final alignment is presented in Section 4.3. 3. Related work Ontology matchers usually include several basic matchers [2]. Although all ontology matching systems use similar basic matchers, differences in implementation of these basic matchers lead to different resulting correspondences between entities of two ontologies. The selection of basic matchers is a great challenge because the set of basic matchers has to utilize all information contained within an ontology in order to obtain the best possible results. Another challenge is how to aggregate the results obtained by these matchers and how to determine the final alignment. The matching systems that achieved the best results in the evaluation on the Benchmark biblio test set at the last three OAEI contests (2013 [27], 2014 [28] and 2015 [29]) will be presented in detail in this section. A large number of ontology matcher components (basic matchers, aggregation methods, final alignment methods, etc.) used by those matching systems will be described. Together with the state-of-the-art (i.e. most recent) matching systems, the COMA++ system is also presented in this section due to its contribution to the progress in the ontology matching field. Finally, we present three self-configuring ontology matching systems. COMA++. The COMA++ system [6] is an upgrade of the COMA system [30]. This system is the predecessor to the most state-of-the-art systems. It consists of several basic matchers and uses a parallel composition of basic matchers in order to determine better results. Also, multiple aggregation methods and final alignment methods are presented. A large number of basic matchers determine correspondence between entities by comparing entity strings. Methods such as Prefix, Suffix, Levenshtein distance [31] are applied, together with a method that uses Wordnet [32] in order to detect synonyms and hypernyms within strings. This system also contains basic matchers based on ontology 8

298

300

302

304

306

308

310

312

314

316

318

320

322

324

326

328

330

332

334

336

338

340

structure. One matcher compares strings of all entities in the part of the ontology structure where the entities being compared are situated. The other two matchers deal with the entities related to the entities being compared by relations of subclass and subproperty. After the execution of basic matchers, the results have to be aggregated. The authors propose four aggregation methods: Minimal value, Maximal value, Average and Weighted aggregation, among which the latter gives the best results. The Weighted aggregation method evaluates differently the correspondences of every basic matcher considering the overall quality of results obtained by an individual matcher. The biggest challenge is how to determine the weighting factor of an individual basic matcher i.e. how to determine the quality of matching results that certain basic matcher achieved. In contrast, our Autoweight++ aggregation method automatically determines the values of the weighting factors considering the results just obtained by basic matchers, which is an important advantage of our method. Three methods are proposed for the final alignment: the Threshold method, which is used together with the method MaxN and method MaxDelta. The wellknown problem with the Threshold method [2] is the selection of the proper threshold value: when the threshold is too high some useful correspondences are omitted from the final alignment; when it is too low, additional incorrect correspondences occur in the final alignment. Since the mapping relationship between entities of different ontologies is usually in the 1:1 ratio, it would be good to restrict the number of correspondences for each entity that can become part of the final alignment. Therefore, the Threshold method is usually used together with another final alignment method. The MaxN method [2] takes the N highest correspondences of each entity in the final alignment. When using MaxN where N is greater than one, the final alignment consists of many (up to N ) correspondences for each entity. Therefore, it may happen that the final alignment contains a lot of false correspondences. The Max1 method is usually used in the matching systems. This method selects only the highest correspondence of each entity in the final alignment. The correspondence between the two entities of different ontologies is the highest correspondence only if its value is higher than the values of all correspondences in which one of the currently compared entities is involved. The deficiency of this method is that it does not check the number of the highest correspondences in which the same entity is included. Only one of these correspondences can be a part of the final alignment. Hence, some entities of one ontology that have the highest correspondence with the same entity of another ontology will not be included in any correspondence of the final alignment. Our final alignment method (Section 4.3) offers a solution to this problem, by performing the alignment iteratively. The MaxDelta method [2] selects the correspondences whose values differ by less than some given delta value from the correspondence with the highest value, and puts these correspondences in the final alignment. The problem of this method is that the final alignment can also contain more correspondences for each entity, so it can happen that there is a large number of false correspondences.

9

342

344

346

348

350

352

354

356

358

360

362

364

366

368

370

372

374

376

378

380

382

384

YAM++. The YAM++ system [7] contains several basic matchers that the authors divide into two groups: basic matchers based on entities [33, 34, 35] and basic matchers based on ontology structure [36]. The system contains several versions of Label matcher (extracts the label of each entity) that use different metrics (Jaro [37], Levenshtein [31], Smith-Waterman [38], Monge-Eklan [39], Jiang-Conrath [40], and Wu-Palmer algorithms [41]). The measures TF/IDF [42] and cosine similarity [43] are used for determining correspondences by the Context matcher. The Context matcher extracts the annotation of certain entity as well as the annotation of its descendants and ancestors. Although the Context matcher uses structure information to determine the correspondences, the authors classify it as an element-level matcher because it determines correspondences using some string matching techniques like the other element-level matchers. The authors regard only modified similarity flooding [36], which determines the correspondences between entities through the propagation of correspondence results (obtained by element-level matchers) within the ontology structure, as a structure-level basic matcher. Weighted aggregation is proposed as the aggregation method, but the algorithm for setting the weighting factors is not explained. Therefore, it is assumed that the authors use their experience from previous testing of YAM++ system in order to determine weighting factors. The Hungarian method [44] is used to determine the final alignment. The Hungarian method finds the maximum sum of correspondences in such a way that only one correspondence of each entity in both ontologies is included in the final alignment (one entity in the first ontology is related to one entity in the second ontology and vice versa). A deficiency of the Hungarian method lies in the fact that correspondences with highest values from the perspective of particular entities could be left out from the final alignment, because the sum of correspondences needs to be maximal. For example, correspondences c(e2 , e′1 ) = 0.86 and c(e1 , e′2 ) = 0.86 will be included in the final alignment instead of correspondences c(e1 , e′1 ) = 0.90 and c(e2 , e′2 ) = 0.80 (ei is an entity of the ontology O and e′j is an entity of the ontology O′ ). CIDER-CL. The CIDER-CL system [8] consists of ten basic matchers that determine correspondences between entities comparing different ontology components: the annotations of entities (label and comment), equivalent entities, super-entities, sub-entities, properties, property domain and property range. In every basic matcher of the CIDER-CL system, the strings obtained from previously mentioned components are compared in order to determine correspondences between entities of different ontologies. The measure Soft TF/IDF [45] is used for determining similarities between two lists of word tokens. This measure is a combination of TF/IDF [42] and Jaro distance [37]. The CIDERCL system uses a neural network [46] to determine aggregate correspondences. The Max1 [30] method combined with a defined correspondence threshold is used to determine the final alignment. In the final alignment process, the Max1 method selects the highest-value correspondence for each entity that will be a part of the final alignment. The problem can occur if two entities in the first ontology have the highest correspondence with the same entity from the second 10

386

388

390

392

394

396

398

400

402

404

406

408

410

412

414

416

418

420

422

424

426

428

ontology. This problem has already been described in detail while presenting the COMA++ system. IAMA. The IAMA system [9] consists of four element-level basic matchers. The system uses the Threshold method for selecting correspondences into the final alignment. All basic matchers used Levenshtein distance method [31]. These basic matchers are: Name matcher (compares URIs of entities), Label matcher (compares labels of entities), Comment matcher (compares comments of entities) and Individual matcher (compares instances of entities). The results of the first three basic matchers are aggregated by weighted aggregation, where the weighting factors have been heuristically determined based on the authors experience (values of the factors are always the same). We believe that the weighting factors of a particular basic matcher should be determined considering the content of the compared ontologies. For example, if the matcher that determines correspondence between entities by comparing entity comments has the highest weighting factor, and the comments are not defined in one of the matched ontologies, the system will not achieve good results. The aggregated result of the first three basic matchers and the result of the fourth matcher are aggregated by Maximal value aggregation method. For the final alignment method, the system uses the Threshold method. Lily. The Lily system [18] contains several basic matchers that the authors divide into two groups: text basic matchers and basic matchers based on ontology structure. Before the ontology matching process, all information about ontology entities is obtained with the semantic subgraph method [47]. The semantic subgraph method was proposed following a basic notion that the meaning of an ontology element (a graph node) can be inferred by studying the related ontology elements (other graph nodes connected to the target node by edges). Semantic subgraph method produces a semantic graph for every entity within ontology. Semantic subgraph of certain entity contains information about defined relations between this entity and other entities in the ontology. An extracting algorithm based on the electrical circuit model [47, 48] is proposed for extracting information from semantic subgraphs. The Lily system contains three text-based basic matchers that use Semantic Description Document (SDD) method [47, 49], Levenshtein distance [31] and Edit distance [2] to determine correspondence between entities. The outputs of the first three basic matchers are aggregated by weighted aggregation. The authors did not explain how the weighting factors are determined. Although the Semantic Description Document method uses structure information to determine correspondences between entities, the authors identify this matcher as a text matcher because it applies string matching techniques like the other text basic matchers. The system also contains one structure-based basic matcher [50] which is a derivation of the similarity flooding algorithm [36]. It has more strict propagation conditions, which leads to more efficient matching process and better alignments. During the calculation of the final alignment, the Lily system utilizes a method named ontology mapping debugging [51] to improve the final alignment results. This method 11

430

432

434

436

438

440

442

444

446

448

450

452

454

456

458

460

462

464

466

468

470

472

resolves the problem of redundant, imprecise (for cases when the algorithm has not found the best correspondence but only the approximate correspondence), inconsistent (disobeying the axioms, typically equivalentClass and disjointWith axioms) and abnormal (if two entities of ontology O are close in the ontology structure, but they are mapped with two entities in the ontology O′ which are themselves far away from each other) correspondences within the alignment. ODGOMS. The ODGOMS system [10] consists of nine basic matchers executed independently of each other. The element-level basic matchers use the Longest common subsequence method (LCS) [52], String metric for ontology alignment (SMOA) [53], the TF/IDF measure, and the cosine similarity method [43] to determine correspondences between entities. Basic matchers based on ontology structure determine correspondence between class entities by comparing correspondences of their property entities. When the correspondence between the property entities of compared classes is high, then the correspondence between these class entities is also high. Correspondence value is calculated in such way that fifty percent of the correspondence between two class entities is summarized with the fifty percent of the correspondence between property entities of the current class entities. Certainly, these methods would produce better results if the correspondences were determined iteratively. For the final alignment, the authors propose the Threshold method. A deficiency of this system is the fact that an aggregation method is not implemented. The system selects only one correspondence (one result) between two entities of different ontologies among nine results between these entities obtained by nine basic matchers. The authors arranged the basic matchers according to their heuristically determined importance, which is not explained in the work. If the correspondence between two entities from the set of results obtained by the most important matcher satisfies the criteria for the final alignment, it is selected for the final alignment. If the correspondence does not satisfy the criteria, the process checks the results for the two entities obtained by the second important matcher etc. Once a satisfying correspondence has been found, correspondences values between these two entities obtained by the remaining matchers are excluded from further consideration. It is possible that absolutely false correspondences between two entities enter into final alignment if values calculated by the matchers considered as the most important strongly differ from the real values. Since overlapping of entities forming the correspondences is not allowed, high-quality correspondences (i.e. those corresponding to reality) obtained by basic matchers that are not highly ranked in the ODGOMS system will not be included into the final alignment. WikiMatch. The WikiMatch system [11] contains only three basic matchers that determine correspondences between entities comparing URIs, labels and comments of these entities. This system uses an external resource Wikipedia [54] by which the correspondence between the URIs, labels, and comments of the compared entities is determined. The first basic matcher extracts information from the URI of each entity and uses Wikipedia’s search function to retrieve a set of articles related to each URI. The more the sets of articles coincide, the better 12

474

476

478

480

482

484

486

488

490

492

494

496

498

500

502

504

506

508

510

512

514

516

the correspondence value of compared entities. The same principle is applied to the basic matchers that extract information from labels and comments. The Maximal value method is used to aggregate the obtained results. Although WikiMatch is among seven best matching systems in the OAEI evaluation of the Benchmark biblio test set (which is presented in detail in Section 5) the results achieved by this system are significantly lower than the results achieved by the top three systems for this test. ECOMatch. Many matching systems always apply the same, fixed weight for its multiple basic matchers. ECOMatch [55] performs self-configuring of its own matching parameters based on the input ontology. In this method, the user first manually determines certain number of correct correspondences. With the basic assumption that these correspondences are correct, the ontology matching system performs the matching process multiple times, each time with a different parameter configuration. The configuration that achieves the best results for the input correspondence set is subsequently applied to determine the entire set of correspondences between the compared ontologies. ECOMatch has two disadvantages. First, a large portion of correspondences must be defined manually (at least 15% according to the conducted test for the case when the compared ontologies contain more than 1000 entities). Second, a kind of brute-force approach is applied, simply testing various parameter combinations, which leads to a rapid increase of execution time. eTuner. eTuner [56] is a tool that automatically tunes up the parameters of matching system. In order to determine correspondences between entities of source schema S and target schema T , eTuner creates schemas U and V , which are structurally identical to S. Instances in S are split into two approximately equal parts, one of them being assigned to U the other to V . Structural changes and data perturbations are performed in V to get a schema fairly different from U . eTuner then performs matching between U and V and applies the parameters that have achieved the best results when comparing U and V , for comparing S and T as well. The biggest drawback of this approach is that the parameters are calculated by analyzing only the source schema S, while excluding the target schema T . Furthermore, the evaluation showed that almost thirty minutes are spent on tuning up the system for determining correspondences of schema S with two class entities and 30 property entities. Self-configuring system by Peukert et al. Another self-configuring matching system is described in [57]. The self-configuring part of this system is based on analyzing the input data as well as intermediate matching results obtained by different components within matching system: basic matchers, aggregation methods and final alignment methods. In a preprocessing step before starting the matching process, the system decides which basic matchers are going to be used in the matching process. For example, the entities within input ontologies are analyzed if they contain meaningful labels and thus the system decides whether to use basic matcher that determine correspondences based on 13

518

520

522

524

526

528

530

532

534

536

538

540

542

544

546

548

550

552

554

556

558

labels of entities or not at all during the matching process. After the execution of the selected basic matchers, the system evaluates the obtained results, excluding the basic matchers that did not produced good results despite being selected in the previous step. In the same manner, the system chooses which of the available aggregation methods (Minimal/Maximal value [6], Average [6] and Harmony weighted aggregation method [3]) and final alignment methods (Threshold, MaxN and MaxDelta [6]) will be applied. Among a large set of matching features and rules, the most similar to our approach is the Noise feature, which analyzes the correspondence matrix structure. The system achieves good results with respect to OAEI 2010 evaluation, but the results again depend mostly on the available matchers and methods within the system. For example, if none of the aggregation methods can significantly improve the matching results or a collection of basic matchers do not use all the information available within the compared ontologies, the matching result will not be the best possible. 4. Proposed ontology matching approach As mentioned earlier, our prototype leans on the architecture of sequentialparallel composition proposed in [5], but major enhancements and changes were made in the meantime with respect to that previous solution. The system architecture has been changed as a result of applying an additional aggregation method for aggregating results obtained by structure-based and string-based basic matchers. New basic matchers have been introduced (Profile matcher, Instance matcher, Additional instance matcher, Constraint matcher, Domain matcher, Range matcher) as well as a new method for the final alignment. Besides, the method for automated calculation of weighting factors for basic matcher aggregation has been enhanced to its present version, Autoweight++. All components of the system will be presented in the remainder of this section. 4.1. Basic matchers Our system consists of basic matchers that work with strings describing ontology entities, as well as of basic matchers dealing with how simpler ontology entities are comprised into a more complex structure. First, a parallel composition of string-based basic matchers is performed, and the aggregated results are then used to perform a parallel composition of basic matchers based on ontology structure (from this point on, we will use simpler terms string matcher and structure matcher). As stated before, the classification of basic matchers that is most common in the literature divides matchers into element-based and structure-based [2], which takes ontology structure as the main perspective. However, in our work, we take a different viewpoint, more oriented to the task of constructing ontology matching systems. We divide basic matchers into string-based and structure-based matchers, where the former apply a set of formulas to strings and may be applied immediately at the beginning of the matching process, while the latter require the output of the former. The strings 14

560

562

564

566

568

570

572

of compared entities in string-based matchers may contain information about entities, instances of those entities and information about adjacent entities in the ontology structure. Structure-based matchers use the results obtained by stringbased matchers to determine the correspondences between entities through the propagation of those correspondence results within the ontology structure. The same classification principle is also used in [18, 33, 34, 35]. 4.1.1. Basic matchers based on entity strings String matchers compare character arrays (strings) pertaining to particular entities based on the notion that the correspondence of those strings (annotations or IDs) is correlated to the correspondence of those entities in general. Some of the basic matchers use only data at the level of one particular entity, whereas others combine the data both on the single-entity and the structure level. The string matchers are listed below (a brief comparative summary is given in Table 1). Annotation matcher. This basic matcher finds correspondences between strings obtained from IDs (formally defined by RDF/OWL structures rdf:ID or rdf:about) and annotations (which are labels and comments, defined by RDF/OWL structures rdfs:label and rdfs:comment, respectively) of two ontology entities (either classes or properties) using bigram (i.e. 2-gram) similarity [2]. The n-gram similarity measure [2] defines the value of correspondence by comparing the existing common sub-strings of length n within the strings that represent the entities, as defined by the expression δ(s, t) =

574

576

578

580

582

584

586

588

590

|ngram(s, n) ∩ ngram(t, n)| min(|s|, |t|) − n + 1

(3)

where |ngram(s, n) ∩ ngram(t, n)| is the total number of common substrings of length n within the input strings s and t, min(|s|, |t|) is the length (i.e. number of characters) of the shorter string, while n is the length of substrings compared within s and t given at the input. String s represents the first entity, whereas t represents the second entity. The larger the number of common substrings of size n within s and t, the more similar their corresponding entities. Profile matcher. This basic matcher finds the correspondence between the profiles of two input entities (classes or properties) using the TD/IDF measure [42] and cosine similarity [43]. The profile of a class entity is a document that contains the text values of the own annotations of that class as well as the annotations of all its descendant classes because the descendant classes inherit all features of the current class as opposed to the ancestor class that usually have more general features than the current class. Moreover, the profile contains annotation text values of all properties whose domain is the particular class. The profile of a property entity is a document that contains the text values of the own annotations of the property and the annotations of all its descendant properties. The principle behind this basic matcher is to determine the key terms within those documents. The more similar the key terms of the documents, the higher 15

Table 1: Overview of the basic matchers in the CroMatcher system that are based on entity strings

Annotation matcher Profile matcher

Instance matcher

Additional instance matcher

Constraint matcher

- ID (a # part of URI) - annotations (label, comment) class - annotations (label, comment) - subclass annotations - property annotations property - annotations (label, comment) - subproperty annotations class - instance values - subclass instance values property - instance values of property range (class instance values or data values) class - instance values - superclass instance values property - instance values of property range (class instance values or data values) or property domain (class instance values) class - number of (object and datatype) properties - number of equal property cardinalities (min, max, exact) within class properties - number of parents and children property - number of domain classes - number of sub and superproperties

16

n-gram similarity

TD/IDF measure and cosine similarity

TD/IDF measure and cosine similarity

TD/IDF measure and cosine similarity

an average of the correspondence values obtained by comparing various ontology constraint constructors; generally a ratio of the smaller and the larger of the two numbers (cardinalities) corresponding to the two entities in question

592

594

596

598

600

602

604

606

the value of entity correspondence. More details on the TF/IDF measure and cosine similarity are given in [58]. Instance matcher. This basic matcher calculates the correspondence between instances of two entities (class instances or property instances) using the TF/IDF measure and the cosine similarity formula. A class entity instance comprises the text values (strings) of all instances of the particular class as well as the instances of all classes that are descendants (at any level) of the particular class. A property entity instance comprises text values of all instances belonging to classes that are defined as the property range. Additional instance matcher. This basic matcher calculates the correspondence between additional instances of two entities (class instances or property instances) using the TF/IDF measure and the cosine similarity formula. An additional instance of a class entity comprises text values of all instances of the particular class as well as the instances of all classes that are parents (superclasses) of the particular class. Additional instance of a property entity comprises text values of the instances of all classes defined either as the domain or the range of the particular property. Constraint matcher. This basic matcher calculates the correspondence between several key features of two entities (classes or properties). Key features of class entities are the number of its object and data properties, the number of cardinality constraints (minimum cardinality, maximum cardinality and exact cardinality) on its properties and the number of its direct superclasses (parent classes) as well as its direct subclasses (children classes). Key features of property entities are the number of entities defined as the domain of the particular property, as well as the number of its parent and children properties. The higher the similarity between the features, the larger the correspondence between the considered entities. The value of correspondence between two classes c and c′ is determined as: δ(c, c′ ) =

δobject + δdata + δcard + δpar + δch 5

(4)

where δobject is the ratio of the smaller and the larger number of object properties for the two classes c and c′ ; δdata is the ratio of the smaller and the larger number of data properties that belong to the considered classes; δpar and δch are the ratio of the smaller and the larger number of classes that are parents and children, respectively of the considered classes. Finally, δcard is determined as: δcard = 608

610

δmin + δmax + δexact 3

(5)

The symbol δmin corresponds to the ratio of the number of cross-entity minimum cardinality constraint pairs with equal value and the larger of the two total amounts of minimum cardinality constraints. Likewise, δmax is the ratio of the number of cross-entity maximum cardinality constraint pairs with equal value 17

612

614

616

618

620

622

624

626

628

630

632

634

636

638

640

642

644

646

648

650

and the larger of the two total amounts of maximum cardinality constraints, whereas δexact is the ratio of the number of cross-entity exact cardinality constraint pairs with equal value and the larger of the two total amounts of exact cardinality constraints. The value of correspondence between two properties p and p′ is determined as: δdomain + δsuper + δsub δ(p, p′ ) = (6) 3 The symbol δdomain pertains to the ratio of the smaller and the larger total amount of classes defined as the domain of the considered properties. For instance, if the property p has 3 classes defined as its domain, whereas the property p′ has 2 classes defined as its domain, then the value of δdomain is 2/3. The symbol δsuper corresponds to the ratio of the smaller and the larger number of parents (i.e. direct super-properties) of the two considered properties, while δsub is the ratio of the smaller and the larger number of children (i.e. direct sub-properties) of the considered properties. 4.1.2. Basic matchers based on entity structure Structure matchers employ the ontology structure to determine the correspondence between ontology entities. The basic principle can be illustrated by the following simple example: if the parents of the considered entities are similar, then the considered entities are similar as well. Obviously, correspondences between particular entities must already be known, so that new correspondences based on the structure may be computed. As mentioned earlier, structure matchers use correspondence obtained by the string matchers. Structure matchers are listed below (a brief comparative summary is given in Table 2). SuperEntity matcher. This basic matcher calculates the correspondence between two entities by comparing the mutual correspondences between their parent entities. If the parents are similar, then the considered entities are also similar. The matcher is executed iteratively and terminates either when the value of the correspondence between the two entities converges (during the evaluation process we set the execution to stop when the average value change for all correspondences is below 0.01) or after a certain number of iterations were performed (we used the threshold of 30 iterations). In each iteration the new value is obtained as the sum of one half of the current correspondence and one half of the correspondence obtained by the comparison of parent entities (the formula will be explicated later in Equation 7). The parameter values (value change below 0.01, 30 iterations, weighting factors of one half against one half) were chosen as a result of our unofficial testing on Benchmark biblio open tests of OAEI evaluation held in 2012 [59]). In the future we are going to automate the determination of these two parameter values. Instead of using the two constants determined heuristically by observing the results of the Benchmark biblio tests (these tests are certainly the starting point while testing the quality of a

18

Table 2: Overview of the basic matchers in the CroMatcher system that are based on entity structure

SuperEntity matcher

Super-entities of currently compared entities

SubEntity matcher

Sub-entities of currently compared entities

Domain matcher

Range matcher

class

-

properties

-

properties

-

properties that have the current class defined as domain class (or a union of classes) defined as the domain of the property class (or a union of classes) defined as the range of the property

19

iterative calculation of correspondence value based on 50% of the current correspondence value and 50% of the structure correspondence value iterative calculation of correspondence value based on 50% of the current correspondence value and 50% of the structure correspondence value correspondence value between the entities related to the property domain

correspondence value between the entities related to the property range

652

654

656

658

660

662

664

666

668

670

672

674

676

678

matching system), their values will be calculated by analyzing the ontologies being compared. The same challenge exists for the SubEntity matcher. The procedure must be repeated iteratively because only a single execution will most probably not be able to determine correctly the correspondences between the entities. In the first iteration new correspondences between the target entities in the two graphs (where ontology entities are nodes) are calculated based on the entities (i.e. their current correspondences) connected with the target entities in that graphs (the edges represent particular relations, in this particular case the parent-child relation). Thus, the initial correspondence between the entities in question is either increased or decreased (depending on whether the correspondence between the neighboring entities within the graph is greater or smaller). After the change of values at the end of the first iteration, the second iteration follows, in which the correspondences of the entities in question, determined in the first iteration, are changed according to new correspondences (i.e. those obtained in the first iteration) of their neighbors. The value of the correspondence between the neighbors is also changed, because the correspondence between all their neighboring entities changes as well. Consequently, the structure-based calculation procedure must be executed iteratively until particular correspondences become (almost) constant. The method is similar to the similarity flooding algorithm [36]. However, similarity flooding uses many ontology components at the same time (subclass, subproperty, domain, range) to compute the correspondence. On the contrary, we decided to create four different structure matchers, each of them using only one ontology component to perform calculations (the remaining three matchers will be explained in the following paragraphs). Afterwards, the weighted aggregation process based on our Autoweight++ method (Section 4.2.1) will produce the consolidated results, in which the basic matchers computing correspondence with higher quality will prevail with their weighting factor. We will give a simple example to illustrate how the SuperEntity matcher works. Let there be three class entities of ontology O: e1 , e2 and e3 , related by the subsumption relation: e1 → e2 → e3 (e3 is a child of e2 , while e2 is a child of e1 ). Let there be three class entities e′1 , e′2 , e′3 , of ontology O′ , related by the subsumption relation as: e′1 → e′2 → e′3 . After the execution of the basic matchers based on a single entity (i.e. the string matchers), only the correspondence c(e1 , e′1 ) between the entities e1 and e′1 was high. Correspondence values c(e2 , e′2 ) and c(e3 , e′3 ) were low, due to lack of information about those entities in O and O′ . The final output of the SuperEntity matcher should determine the correspondences c(e1 , e′1 ), c(e2 , e′2 ) and c(e3 , e′3 ). If we want to determine the correspondence c(e3 , e′3 ) based on the parent entities of e3 and e′3 , we will take into consideration the correspondence c(e2 , e′2 ). Since the value of c(e2 , e′2 ) is small as well, the value of correspondence c(e3 , e′3 ) will remain small. On the other hand, in order to determine the correspondence c(e2 , e′2 ) based on their children, we will take into consideration the correspondence c(e1 , e′1 ). The value of this correspondence is high and hence the value of correspondence c(e2 , e′2 ) will increase after the first iteration. The new value of c(e2 , e′2 ) will increase the value of correspondence c(e3 , e′3 ) at the end of the second iteration. In this way, 20

the matcher will finally produce all three correspondences: c(e1 , e′1 ), c(e2 , e′2 ) and c(e3 , e′3 ). The correspondence between two entities is determined by the following formula: c(e, e′ )now =

680

682

684

686

688

690

692

694

696

698

700

702

704

706

708

710

1 1 · c(e, e′ )cur + · c(e, e′ )str 2 2

(7)

where c(e, e′ )now is the correspondence between entities e and e′ at the end of the considered iteration, c(e, e′ )str is the correspondence between entities e and e′ based on the structure of these entities and c(e, e′ )cur is the “current” correspondence between entities e and e′ i.e. the one obtained at the end of the previous iteration. The value of structure-based correspondence between two entities is determined by the formula we first presented in [5]: Pn maxitern (c(esc , e′sc )) ′ c(e, e )str = i=1 (8) ′ |) max(|Esc |, |Esc Pn ′ The expression i=1 maxitern (c(esc , esc )) iteratively calculates the maximal n-addend sum of one-to-one correspondences between the super-entities of the compared entities. The symbol n is the cardinality of the smaller of the two superclass entity sets. First, the highest correspondence that occurs between any two super entities is added to the sum. The two entities that have the highest correspondence are removed from the superclass set, and the new highest correspondence between the remaining superclass entities is added to the sum. The process is repeated until one of the superclass entity sets becomes ′ empty. The expression max(|Esc |, |Esc |) is the cardinality of the larger of the two superclass entity sets. As an illustration, we will calculate the correspondence output value of the SuperEntity matcher for two entities, e and e′ with the initial correspondence of 0.65 (an aggregated result of the results produced by the string matchers). Let e have three super-entities: e1 , e2 and e3 . Let e′ have two super-entities: e′1 and e′2 . Let the correspondences between the super-entities (before applying the SuperEntity matcher) be the fol′ ′ lowing: c(e1 , e′1 ) = 0.87, c(e1 , e′2 ) = 0.25, c(e2 , eP 1 ) = 0.35, c(e2 , e2 ) = 0.87, n ′ ′ c(e3 , e1 ) = 0.15, c(e3 , e2 ) = 0.24. The value of i=1 maxiter2 (c(esc , e′sc )) is 1.74. The two largest correspondences are summed (n = 2, since the cardinality of the smaller of the two super-entity sets is 2), which are c(e1 , e′1 ) = 0.87 ′ and c(e2 , e′2 ) = 0.87. Next, 1.74 is divided by max(|Esc |, |Esc |), which is 3 (the cardinality of the larger of the two super-entity sets) and the obtained structural correspondence is c(e, e′ )str = 0.58. According to Equation 7, the new correspondence is the average of the current correspondence 0.65 and c(e, e′ )str , ( 21 · 0.65 + 21 · 0.58), which is 0.615. The formulae shown in Equations 7 and 8 were taken from [5]. However, only the version of matcher presented in this paper uses an iterative process (inspired by similarity flooding), and consequently produces better correspondences (the matcher in [5] was executed in a single step).

21

712

714

716

718

720

722

724

726

728

730

732

734

736

738

740

742

744

746

748

750

752

754

SubEntity matcher. This basic matcher calculates the correspondence between two entities (classes or properties) by comparing the mutual correspondences between their children entities. If the children of the entities are similar, then the entities themselves are similar as well. Like SuperEntity matcher, SubEntity matcher is executed iteratively and terminates either when the value of the correspondence between the two entities converges (during the evaluation process we set the execution to stop when the average value change for all correspondences was below 0.01) or after a certain number of iterations were performed (we set the threshold of 30 iterations). In each iteration the new value is obtained as the sum of one half of the current correspondence and one half of the correspondence obtained by the comparison of children entities. This basic matcher works on the same principles as the SuperEntity matcher, but the correspondences are determined by comparing the sub-entities (children entities) and not the super-entities (parent entities). Domain matcher. This basic matcher has two versions, one for class entities and the other for property entities. Correspondences between class entities are calculated by comparing all the properties (and their ranges) that have the target classes as their domains. With respect to property entities, the comparison is performed between the classes defined as the domain of the considered properties. For the purpose of comparing the domain classes we apply the correspondences obtained by executing the string matchers. The higher the string correspondence between the domain-related entities (classes or properties, depending on the target entities), the higher the domain correspondence between the target entities themselves. The matcher works on the same principle as the SuperEntity and SubEntity matcher (i.e. using the formula displayed in Equation 8), but is not performed iteratively and therefore does not aggregate the structure-based value and the string-based value (like SuperEntity and SubEntity matcher did, applying Equation 7), using only the structure-based value instead. Range matcher. This basic matcher works exclusively for property entities. The correspondences are calculated by comparing the classes that are defined as the range of the properties in question. The matcher works on the same principle as the three previously presented structure matchers. Like the Domain matcher, the calculation is not performed iteratively, and the returned value entirely corresponds to the structure-based value. The presented nine matchers calculate correspondences by exploiting information from all principal ontology components. The results of the evaluation in Section 5 will show that the system discovers the correspondences well, which also proves that the basic matchers have been chosen well. 4.2. Autoweight++, a method for automated calculation of aggregated alignment in a parallel composition of basic matchers Each of the basic matchers produces an alignment. Since the basic matchers are joined into a parallel composition (as stated in Section 2.3), we have 22

756

758

760

762

764

766

768

770

772

774

776

to perform an aggregation process, i.e. merge all the alignments into one single common alignment, which will subsequently be used to produce the final alignment. In our matching system we use Autoweight++ (the enhanced version of the Autoweight method presented in [4, 5]) to aggregate the output of the basic matchers. In Autoweight++ (in comparison with the previous version of Autoweight), we solved the problem of nonexistent correspondences for certain basic matchers (due to their inapplicability) in the process of calculating the aggregated correspondences between two entities. When weighted aggregation is used in a parallel composition of basic matchers, the first task is to determine the weighting factors for each of the basic matchers. In Section 4.2.1 we will present how this task is performed by Autoweight++, with an emphasis on the enhancements to the basic Autoweight method. In Section 4.2.2 we will then present how Autoweight++ applies the obtained factors to compute the aggregated values, paying particular attention to the nonexistent correspondences. 4.2.1. Automated calculation of weighting factors for a parallel composition of basic matchers based on weighted aggregation We introduced the Autoweight method for the first time in [4], where a part of the Harmony method [3] was adapted for the process of computing the weight of every basic matcher in a weighted aggregation. We took over the definition of the highest correspondences within the set of all correspondences (i.e. an alignment) obtained by executing a basic matcher. Definition 4 (Highest correspondence). A correspondence between the two entities ei ∈ O and e′j ∈ O′ has the quality of being the highest correspondence if and only if it has higher confidence value (i.e. value of n in Definition 2) than any other correspondence of either ei or e′j with some other entity. A highest correspondence of ei and e′j will be shortly denoted as cmax (ei , e′j ). n  cmax (ei , e′j ) ≡ c(ei , e′j ) = maxk∈O′ c(ei , e′k ) ∧ (9)  o c(ei , e′j ) = maxl∈O c(el , e′j )

Given one particular basic alignment P k , the isMaxCorr function defined in Equation 10 will return 1 when the correspondence of ei and e′j is highest for the particular alignment P k . Otherwise, it will return 0.  c(ei , e′j )P k ≡   1, cmax (ei , e′j )P k (10) isMaxCorr(i, j, P k ) =   0, otherwise

778

780

In the Harmony method, the contribution of every pair of elements with highest correspondence in one particular alignment (i.e. the result obtained by a particular basic matcher) is equal. In our Autoweight method, we assumed that a highest correspondence found within several basic alignments has less 23

782

importance than another highest correspondence that was found within only one basic alignment P k . Definition 5 (Importance coefficient of a highest correspondence). Given a set of n basic matchers with their alignment matrices P 1 , . . . , P n , the importance coefficient (or, shortened, importance) of highest correspondence cmax (ei , e′j ) is reciprocal to the number of its occurrences (the noHCOcc function in Equation 11) as highest correspondence (the third expression in Equation 12). Specifically, when the number of occurrences is equal to n, the value of the importance coefficient is zero (the first two expressions in Equation 12). noHCOcc(i, j) =

n X

isMaxCorr(i, j, P k )

(11)

k=1

imij =

784

786

788

790

792

794

796

798

800

802

804

806

808

        

0,

noHCOcc(i, j) = 0

0, noHCOcc(i, j) = n 1 , otherwise noHCOcc(i, j)

(12)

Evaluation tests performed in [4] demonstrated that weighted aggregation based on Autoweight (the initial version) produced the best results in comparison with other methods. Thus, the hypothesis about a different influence of particular highest correspondences in the weight calculation process was confirmed. In Autoweight++ we introduce a value threshold to highest correspondences, which additionally singles out the relevant highest correspondences that affect the weighting factor for each of the basic matchers. If the value of a highest correspondence is smaller than the given threshold, this highest correspondence is not included in the process of computing the weighting factors. We will present the entire process of calculating the weighting factors with Autoweight++, based on the example illustrated in Fig. 1. The pseudocode for the weighting factor calculation algorithm embedded in Autoweight++ is given in Algorithms 1 and 2. Every instruction in the pseudocode of the algorithms is marked with its line number, which will be inserted in the text in order to relate the pseudocode and the presented example. The result of executing each basic matcher is an alignment i.e. a correspondence matrix containing the correspondences for all entity pairs from the ontologies submitted to the matching process. Once the matrices have been obtained (Algorithm 1, pseudocode line 2), the greatest correspondences need to be determined for each row and each column of a particular matrix. Considering the example shown in Fig. 1, the greatest correspondence for a column is labeled with an O, whereas the greatest correspondence for a row is labeled with an X. Next, the highest correspondences between two entities need to be determined. As mentioned earlier, the highest correspondence between two entities is the correspondence where an entity from the first ontology has the greatest correspondence with an entity from the second ontology and vice versa. Hence, a correspondence between two entities is a highest correspondence if the labels 24

P1 e1 e2 e3 e4

e′1 0.03 0.01 0.08 0.45

e′2 0.04 0.85 0.05 0.04

e′3 0.95 0.15 0.20 0.07

e′4 0.09 0.02 0.93 0.03

e′5 0.08 0.15 0.05 0.90

P2 e1 e2 e3 e4

e′1 0.15 0.01 0.05 0.25

e′2 0.25 0.55 0.10 0.03

e′3 0.75 0.10 0.05 0.07

e′4 0.03 0.02 0.20 0.05

e′5 0.02 0.10 0.15 0.29

P3 e1 e2 e3 e4

e′1 0.10 0.02 1.00 0.35

e′2 0.09 0.60 0.03 0.04

e′3 -1 0.08 0.06 0.03

e′4 0.07 0.10 1.00 0.03

e′5 0.05 0.06 0.03 0.50

e′1

e′2

e′3

e′4

e′5

e′1

e′2

e′3

e′4

e′5

e′1

e′2

e′3

e′4

e′5

e1 e2 e3 e4

e1 e2 e3 e4

e1 e2 e3 e4

-highest correspondence in column -highest correspondence in row Threshold 0.30

Two highest correspondences in the same row

Highest correspondences: c1max : c(e1 , e′3 ), c(e2 , e′2 ), c(e3 , e′4 ), c(e4 , e′5 ) c2max : c(e1 , e′3 ), c(e2 , e′2 ), c(e3 , e′4 ), c(e4 , e′5 ) c3max : c(e2 , e′2 ), c(e4 , e′5 ) im22 im13 im34 im45

= 1/#cmax (e2 , e′2 ) = 1/3 = 1/#cmax (e1 , e′3 ) = 1/2 = 1/#cmax (e3 , e′4 ) = 1 = 1/#cmax (e4 , e′5 ) = 1/2

Excluding highest correspondence value smaller than threshold

Highest correspondence detected by all three matchers

im1 = im13 + im34 + im45 = 1/2 + 1 + 1/2 = 2 im2 = im13 = 0.5 im3 = im45 = 0.5 w1 = im1 /(im1 + im2 + im3 ) = 2/3 = 0.67 w2 = im2 /(im1 + im2 + im3 ) = 1/6 = 0.165 w3 = im3 /(im1 + im2 + im3 ) = 1/6 = 0.165 Figure 1: An example of Autoweight++ method for automated calculation of weighting factors for each of the basic matchers in the ontology matching process

25

procedure computeHighestCorr Data: P - an alignment matrix of dimension |O| · |O′ | obtained by a basic matchers; thrHC - highest correspondence threshold Result: H - a set of highest correspondences 2 H ←− ∅; HM ax ←− M atrix(|O|, |O′ |); V M ax ←− M atrix(|O|, |O′ |); 3 for i ←− 1 to |O| do 4 for j ←− 1 to |O′ | do 5 hM axij ←− FALSE; vM axij ←− FALSE; 6 for i ←− 1 to |O| do 7 maxV al ←− 0; maxIndex ←− 0; 8 for j ←− 1 to |O′ | do 9 if pij > maxV al then 10 maxV al ←− pij ; maxIndex ←− j; 11 else if pij = maxV al then 12 maxIndex ←− 0; 13 if maxIndex > 0 then hM axi,maxIndex ←− TRUE ; 14 for j ←− 1 to |O′ | do 15 maxV al ←− 0; maxIndex ←− 0; 16 for i ←− 1 to |O| do 17 if pij > maxV al then 18 maxV al ←− pij ; maxIndex ←− i; 19 else if pij = maxV al then 20 maxIndex ←− 0; 21 if maxIndex > 0 then vM axmaxIndex,j ←− TRUE ; 22 for i ←− 1 to |O| do 23 for j ←− 1 to |O′ | do 24 if vM axi,j and hM axi,j and pij ≥ thrHC then 25 h.row ←− i; h.col ←− j; H ←− H ∪ {h} Algorithm 1: Procedure computeHighestCorr, used by Autoweight++ (Algorithms 2 and 3) to calculate weighting factors for basic matchers as well as the algorithm that produces the final alignment (Algorithm 4) 1

810

812

814

816

818

820

O and X co-occur in a single cell of the matrix, if the O label is the single O label in that column, and if X is the single X label in that row. In the example illustrated in Fig. 1, the greatest correspondence for the first column (Algorithm 1, lines 14-21: maxV al = 0.45, maxIndex = 4, vM ax1,4 = TRUE) in the matrix P 1 does not match any greatest correspondence in a row (lines 6 - 13: row 1: maxV al = 0.95, maxIndex = 3, hM ax1,3 = TRUE; row 2: maxV al = 0.85, maxIndex = 2, hM ax2,2 = TRUE; row 3: maxV al = 0.93, maxIndex = 4, hM ax3,4 = TRUE; row 4: maxV al = 0.90, maxIndex = 5, hM ax4,5 = TRUE; all maxIndex values differ from 1, which represents the first column index), so that no highest correspondence exists for e′1 (line 24: for every vM axi,1 , the value is FALSE). The correspondence with the greatest value for the second column of P 1 (lines 14-21: maxV al = 0.85, maxIndex = 2, vM ax2,2 = TRUE) 26

procedure computeWeights SN Data: Pall = i=1 Pi - the set of alignment matrices of dimension |O| · |O′ | (one for every basic matcher); thrHC - highest correspondence threshold Result: W - a vector of weighting factors for N basic matchers 2 W ←− V ector(N ); H ←− V ector(N ); Cnt ←− M atrix(|O|, |O′ |); Im ←− M atrix(|O|, |O′ |); 3 for i ←− 1 to |O| do 4 for j ←− 1 to |O′ | do 5 cntij ←− 0; 6 for k ←− 1 to N do 7 Hk ←− computeHighestCorr(Pk , thrHC ); 8 foreach h ∈ Hk do 9 i ←− h.row; j ←− h.col; cntij ←− cntij + 1; 10 for i ←− 1 to |O| do 11 for j ←− 1 to |O′ | do 12 if cntij > 0 and cntij < N then 13 imij ←− 1/cntij 14 else imij ←− 0 ; 15 imsum ←− 0; 16 for k ←− 1 to N do 17 imk ←− 0; 18 foreach h ∈ Hk do 19 i ←− h.row; j ←− h.col; imk ←− imk + imij ; 20 imsum ←− imsum + imk ; 21 for k ←− 1 to N do 22 wk ←− imk /imsum; Algorithm 2: Procedure computeWeights, part of Autoweight++ method which performs weighting factor calculation 1

822

824

826

828

830

832

834

matches the correspondence with the greatest value for the second row (lines 6 - 13: maxV al = 0.85, maxIndex = 2, hM ax2,2 = TRUE). Therefore, there is a highest correspondence cmax (e2 , e′2 ) between entities e2 and e′2 . Apart from cmax (e2 , e′2 ), the matrix P 1 contains other three highest correspondences: cmax (e1 , e′3 ), cmax (e3 , e′4 ) and cmax (e4 , e′5 ). The correspondence matrix P 2 contains four highest correspondences: cmax (e2 , e′2 ), cmax (e1 , e′3 ), cmax (e3 , e′4 ) and cmax (e4 , e′5 ), whereas the correspondence matrix P 3 contains only two highest correspondences: cmax (e2 , e′2 ) and cmax (e4 , e′5 ). The correspondence c(e3 , e′4 ) in P 3 is not a highest correspondence since there exists another correspondence with its value equal to c(e3 , e′4 ) in the same row: c(e3 , e′1 ), as noted by another label X; thus c(e3 , e′4 ) in P 3 does not take part in the calculation process (Algorithm 1, line 11: two equal values: maxIndex = 0; the same check in line 20 for columns). The presented initial step has been taken from the Harmony method [3], since we opine that highest correspondences are the key asset for 27

836

838

840

842

844

846

848

850

852

854

856

858

860

862

864

866

868

870

872

874

876

878

880

weighting factor calculation. However, the Harmony method has certain deficiencies, which will be described in the following paragraphs, and, consequently, the remaining steps of our method will differ from those of the Harmony method. In the next step we remove all highest correspondences that are smaller than a given threshold. This is a new feature of Autoweight++ (in comparison with the basic Autoweight [5]). The threshold is set in order to discard unreliable highest correspondences from the process. For instance, a highest correspondence of 0.10 is certainly unreliable and it is inappropriate for the weighting factors to base their value on such a small highest correspondence between two entities. This problem was present both in Harmony and in the basic version of Autoweight [4, 5]. In the particular example in Fig. 1, the highest correspondence threshold is set to 0.30. Highest correspondences cmax (e3 , e′4 ) and cmax (e4 , e′5 ) in the correspondence matrix P 2 have values 0.20 and 0.29, respectively, and accordingly are removed from the further calculation process (Algorithm 1, line 24: thrHC = FALSE). Autoweight++ determines the importance of each particular highest correspondence based on how many times this correspondence has been detected as a highest one across all correspondence matrices (i.e. for all basic matchers in total, as already presented in [5]). In the example illustrated in Fig. 1, imij represents the importance of highest correspondence between ei and e′j (i.e. cmax (ei , e′j )). The importance has to be calculated for each particular highest correspondence pair appearing across the correspondence matrices. If the highest correspondence cmax (ei , e′j ) is detected as such in m correspondence matrices (Algorithm 2, lines 6-9: cntij = m), its importance imij is equal to 1 m (lines 10-14: imij = 1/cntij ), which means that the importance is inversely proportional to m. Therefore, the more times a correspondence c(ei , e′j ) occurs as a highest correspondence (i.e. cmax (ei , e′j )), the smaller its importance imij , because we believe that it brings less new information in comparison with a highest correspondence that has the quality of being highest in only one single correspondence matrix (the latter has the highest possible value of imij : one). In a special case, a particular correspondence may be detected as a highest correspondence in all n matrices (for all the matchers). Since we assume that in this case such a correspondence brings no useful information at all, it is excluded from the further calculation process (hence, its importance is zero, not 1 n ; Algorithm 2, lines 10-14: IF cntij = 0 OR cntij = n THEN imij = 0). When a basic matcher detects only those highest correspondences, which other matchers also detect as highest, that matcher should not have a large weighting factor in the aggregation process. Such a way of calculating the weighting factor for each basic matcher, solves the problem detected for the Harmony method, which did not take into account the correspondences obtained by one matcher with respect to the values obtained by other matchers, but instead considered each highest correspondence equally important. Given the example in Fig. 1, the importance im13 of highest correspondence cmax (e1 , e′3 ) is 1/2 (Algorithm 2, lines 10-14: im13 = 1; cnt13 = 2), since the correspondence c(e1 , e′3 ) has the quality of being highest in two of the three correspondence matrices: P 1 and P 2 (lines 6-9: cnt13 = 2). The importance 28

882

884

886

888

890

892

im22 of highest correspondence cmax (e2 , e′2 ) is zero (as mentioned before, it would basically be 1/3, but since it occurs as a highest correspondence in all three matrices, it is omitted from the further calculation process). The highest correspondence cmax (e3 , e′4 ) has the greatest importance value of 1, since it occurs as a highest correspondence only in P 1 . Accordingly, the single basic matcher that produces the correspondence matrix P 1 is the only one able to find the highest correspondence between e3 and e′4 , while other matchers are not able to do so. Hence, the relative importance of that highest correspondence and the related basic matcher must be high. Once all highest correspondences have been identified and their importance calculated, we need to determine the importance of each basic matcher. Definition 6 (Importance coefficient of a matcher). Given a set of n basic matchers with its alignment matrices P 1 , . . . , P n , the importance coefficient (or, shortened, importance) of basic matcher k is the sum of the importance coefficient values of all highest correspondences produced by that matcher. X imij,(isMaxCorr(i,j,P k )=1) (13) imk = i,j

894

896

898

900

902

904

906

908

910

The importance of a matcher imk is calculated by summing the importance values of all highest correspondences produced by that matcher (as explained before, the highest correspondences have importance between zero and one). Considering the example in Fig. 1, the importance im1 of the first matcher is 2 (Algorithm 2, lines 18-19: im1 = im13 + im34 + im45 = 0.5 + 1 + 0.5 = 2), whereas the values of im2 and im3 are both 0.5. The weighting factors are obtained by normalizing those importance values. Definition 7 (Matcher weight). Given a set of n basic matchers with its importance values im1 , . . . , imn , the weight of a basic matcher k is ratio of the importance coefficient of that particular matcher and the sum of the importance coefficients of all values of n matchers. imk wk = Pn (14) l=1 (iml )

In the given example, the sum of matcher importance coefficients is 3 (line 20: imsum = im1 + im2 + im3 = 2 + 0.5 + 0.5 = 3) the value of w1 is 0.670 (lines 21-22: w1 = im1 /imsum = 2/3 = 0.67), while the value of w2 and w3 is 0.165. At this moment, the remaining task for Autoweight++ is to perform weighted aggregation based on the calculated weighting factors and produce a common alignment (i.e. the correspondence matrix) as the parallel composition of basic matchers. 4.2.2. Weighted aggregation in parallel composition of basic matchers as a part of Autoweight++ The Autoweight++ method solves the problem of nonexistent correspondences in the process of calculating the aggregated correspondence between two 29

P1 e1 e2 e3 e4

e′1 0.03 0.01 0.08 0.45

e′2 0.04 0.85 0.05 0.04

e′3 0.95 0.15 0.20 0.07

e′4 0.09 0.02 0.93 0.03

e′5 0.08 0.15 0.05 0.90

P2 e1 e2 e3 e4

e′1 0.15 0.01 0.05 0.25

e′2 0.25 0.55 0.10 0.03

e′3 0.75 0.10 0.05 0.07

e′4 0.03 0.02 0.20 0.05

e′5 0.02 0.10 0.15 0.29

P3 e1 e2 e3 e4

e′1 0.10 0.02 1.00 0.35

e′2 0.09 0.60 0.03 0.04

e′3 -1 0.08 0.06 0.03

e′4 0.07 0.10 1.00 0.03

e′5 0.05 0.06 0.03 0.50

w1 = 0.670 w3 = 0.165

w2 = 0.165

c(e2 , e′2 )A = 0.67 · c(e2 , e′2 )1 + +0.165 · c(e2 , e′2 )2 + +0.165 · c(e2 , e′3 )3 = 0.76 c(e1 , e′3 )A = 0.67 · c(e1 , e′3 )1 + +0.165 · c(e1 , e′3 )2 + +0.165 · c(e1 , e′3 )3 c(e1 , e′3 )3 = −1; substitute with avg[c(e1 , e′3 )1 , c(e1 , e′3 )2 ] c(e1 , e′3 )3 = (0.95 + 0.75)/2 c(e1 , e′3 )3 = 0.85 PA e1 e2 e3 e4

e′1 0.06 0.01 0.23 0.40

e′2 0.08 0.76 0.06 0.04

e′3 0.90 0.13 0.15 0.06

e′4 0.08 0.03 0.82 0.03

e′5 0.07 0.13 0.06 0.73

Figure 2: Example of aggregated alignment calculation by Autoweight++ using the previously computed weighting factors (as shown in Fig. 1)

912

914

916

918

920

922

924

926

928

930

932

934

entities by means of weighted aggregation. A nonexistent correspondence between two entities is the one that cannot be calculated by the particular matcher (for instance, the ontology component used by the matcher to compute the correspondence is missing). A nonexistent correspondence between two entities is replaced with an average correspondence for those two entities obtained by other matchers. The aggregated correspondence for two entities is calculated by multiplying their correspondence in each matrix with the weighting factor of that particular matrix (calculated in the previous step; see Section 4.2.1 and Fig. 1) and summing up those products. An illustration example of the calculation process is given in Fig. 2. The pseudocode for the entire aggregation algorithm performed by Autoweight++ is given in Algorithm 3 (calls to procedures shown earlier in Algorithms 1 and 2 are made). The aggregated correspondence between entities e2 and e′2 is 0.76. The dominant influence of the first basic matcher can be recognized easily: the correspondence obtained by that matcher is 0.85, while its weight is 0.67. Therefore, despite the fact that the other two matches produce significantly lower correspondences (0.55 and 0.60), the first basic matcher prevails due to its large weighting factor calculated by Autoweight++ (in comparison, the pure i.e. nonweighted average of the three values is 0.67). The aggregated correspondence c(e2 , e′2 ) is now 0.76 (Algorithm 3, line 14: w1 = 0.67; w2 = 0.165; w3 = 0.165; r22 = w1 · p1(22) + w2 · p2(22) + w3 · p2(22) = 0.67 · 0.85 + 0.165 · 0.55 + 0.165 · 0.60 = 0.76). 30

1

2 3 4 5 6 7 8 9 10 11 12 13 14 15

936

938

940

942

944

946

948

950

952

954

956

procedure AutoWeightPlusPlus SN Data: Pall = i=1 Pi - the set of alignment matrices of dimension |O| · |O′ | (one for every basic matcher); thrHC - highest correspondence threshold Result: R - the common alignment matrix of dimension |O| · |O′ | obtained by parallel composition using weighted aggregation R ←− M atrix(|O|, |O′ |); W ←− computeWeights(Pall , thrHC ); for i ←− 1 to |O| do for j ←− 1 to |O′ | do sum ←− 0; cnt ←− 0; for k ←− 1 to N do if pk(ij) ≥ 0 then sum ←− sum + pk(ij) ; cnt ←− cnt + 1; avg ←− sum/cnt; rij ←− 0; for k ←− 1 to N do if pk(ij) ≥ 0 then rij ←− rij + wk · pk(ij) ; else rij ←− rij + wk · avg ; Algorithm 3: The entire Autoweight++ algorithm

As already mentioned, particular attention must be paid to nonexistent correspondences, i.e. those for which a particular matcher could produce no output value. We take a different approach with respect to nonexistent values, in accordance with our fundamental understanding that the importance of each basic matcher is determined based on how outstanding it is in comparison with other matchers. We initially assign nonexistent values with value score -1 in order to distinguish them from really calculated correspondences 0. For example, if one of the entities for which the correspondence is calculated has no children, the SubEntity Matcher (Section 4.1.2) is not able to compare them and will consequently produce an output value -1. Regarding those correspondences as zero would significantly (and quite inappropriately, in our opinion) lower the aggregated correspondence. Instead, we provide a mechanism which ensures that the aggregated correspondence will be the result of only those basic matches that do produce an output. During the aggregation process each nonexistent correspondence is substituted with the average of correspondences obtained by those basic matchers that could successfully produce an output result for the entities in question (i.e. all those matchers producing an output value different than -1 for the given entity pair). The example in Fig. 2 shows that basic matchers 1 and 2 produce very high correspondences for entity pair c(e1 , e′3 ): 0.95 and 0.75, respectively, while the third basic matcher is unable to produce any result. If the correspondence of the third matcher is considered to be 0, the aggregated correspondence will be

31

958

960

962

964

966

968

970

972

974

976

978

980

982

984

986

988

990

992

994

996

998

1000

0.76. If the average of the two existing correspondences, 0.85 (Algorithm 3, lines 7-11: p1(13) = 0.95; p2(13) = 0.75; p3(13) = −1; avg = (p1(13) + p2(13) )/2 = 0.85), is used as the substitute for the nonexistent correspondence, the aggregated correspondence is 0.9 (lines 13 - 15: w1 = 0.67; w2 = 0.165; w3 = 0.165; r22 = w1 ·p1(13) +w2 ·p2(13) +w3 ·avg13 = 0.67·0.95+0.165·0.75+0.165·0.85 = 0.9). 4.3. Final alignment The method for automated computation of final alignment, which is one of the contributions of this paper, is iterative. It is based on the highest correspondences (in accordance with Definition 4) taken from the common alignment that aggregates all nine basic matchers. A correspondence between an entity ei from ontology O and entity e′j from ontology O′ is considered to be highest only if its value is greater than all other correspondences between ei and entities in O′ , and all other correspondences between of ej ′ and entities in O. The presented idea has been adopted from the Max1 method for calculating the final alignment [6]. In the first iteration of our algorithm (see Algorithm 4 for the pseudocode), correspondences with the quality of being highest become a part of the final alignment (entire Max1 corresponds to this first iteration of our method). After the first iteration, entities in O (and O′ , respectively) that have not been related by a highest correspondence with any entity in O′ (and O, respectively) do not enter the final alignment. In this way, a correspondence between two entities that would otherwise (taking into perspective all correspondences, i.e. all values in the final alignment matrix presented in Algorithm 4) be considered as significant enough to become a part of the final alignment remains excluded from it, because one of the two entities in question had a greater correspondence with some third entity (whereas the latter correspondence was not highest for that third entity and was therefore itself excluded from the final alignment). We solve this deficiency of Max1 by performing the process iteratively. In the second iteration we take into account only those correspondences between the entities that are currently not a part of the final alignment. The procedure is repeated as long as both of the following conditions hold: first, there are unpaired entities, available for the final alignment; second, the aggregated correspondences in the matrix are higher than a given final alignment threshold. Since the method of iterative inclusion of highest correspondences into the final alignment is generally rather exclusive, the threshold is rather small: after some initial experiments we limited its value to a general interval between 0.15 and 0.25. By performing additional experiments, we obtained 0.22 as the optimal threshold value. An illustration of the process is shown in Fig. 3. Since the threshold value is 0.22 (Algorithm 4, line 1: thrF IN = 0.22), none of the correspondences with value smaller than the threshold can participate in the final alignment. Before the first iteration, the calculation matrix is identical to the common alignment matrix obtained by aggregating the correspondences produced by all matchers in Section 4.2 (lines 2-5; rij - aggregation values; xij = rij ). A correspondence exists (i.e. the matrix contains a cell) for every two members of 32

e′1

e′2

e′3

e′4

e′5

e′6

e′7

e′8

e′9

e1 0.02 0.01 0.15 0.12 0.93 0.01 0.16 0.07 0.19 e2 0.07 0.02 0.25 0.11 0.35 0.18 0.10 0.13 0.16 e3 0.02 0.55 0.03 0.15 0.17 0.13 0.25 0.10 0.07 e4 0.15 0.10 0.25 0.87 0.13 0.11 0.03 0.03 0.02 e5 0.05 0.01 0.02 0.15 0.20 0.10 0.85 0.03 0.04 e6 0.10 0.15 0.01 0.05 0.07 0.08 0.05 0.94 0.17 e7 0.01 0.09 0.06 0.07 0.03 0.12 0.50 0.73 0.72 e8 0.10 0.65 0.75 0.12 0.85 0.05 0.03 0.01 0.02 Threshold 0.22 Highest corr.: c(e1 , e′5 ), c(e4 , e′4 ), c(e5 , e′7 ), c(e6 , e′8 ) Final alignment: c(e1 , e′5 ), c(e4 , e′4 ), c(e5 , e′7 ), c(e6 , e′8 ) e′1

e′2

e′3

e′6

e′9

e2 0.07 0.02 0.25 0.18 0.16 e3 0.02 0.55 0.03 0.13 0.07 e7 0.01 0.09 0.06 0.12 0.72 e8 0.10 0.65 0.75 0.05 0.02 Highest corr.: c(e7 , e′9 ), c(e8 , e′3 ) Final alignment: c(e1 , e′5 ), c(e4 , e′4 ), c(e5 , e′7 ), c(e6 , e′8 ), c(e7 , e′9 ), c(e8 , e′3 ) e′1

e′2

e′6

e2 0.07 0.02 0.18

Highest corr.: c(e2 , e′6 ), c(e3 , e′2 )

e3 0.02 0.55 0.13

Threshold 0.22 !!!

Final alignment: c(e1 , e′5 ), c(e4 , e′4 ), c(e5 , e′7 )

c(e6 , e′8 ), c(e7 , e′9 ), c(e8 , e′3 ), c(e3 , e′2 ) Figure 3: Example of calculating the final alignment

33

1

2 3 4 5 6 7 8 9 10 11 12 13 14

1002

1004

1006

1008

1010

1012

1014

1016

1018

1020

1022

1024

1026

procedure finalAlignment Data: R - the common alignment of dimension |O| · |O′ |; thrF IN final alignment threshold Result: F - a set of correspondences comprising the final alignment F ←− ∅; X ←− M atrix(|O|, |O′ |); for i ←− 1 to |O| do for j ←− 1 to |O′ | do xij ←− rij ; H ←− computeHighestCorr(X, thrF IN ); while H 6= ∅ do foreach h ∈ Hk do F ←− F ∪ {h}; r ←− h.row; c ←− h.col; for i ←− 1 to |O| do xic ←− 0 ; for j ←− 1 to |O′ | do xrj ←− 0 ; H ←− computeHighestCorr(X, thrF IN ); Algorithm 4: Algorithm for producing the final alignment

the two ontologies. The task is to detect highest correspondences and include them into the final alignment. The first iteration finds four correspondences with the quality of being highest: c(e1 , e′5 ), c(e4 , e′4 ), c(e5 , e′7 ), c(e6 , e′8 ), and all four values are larger than the threshold (lines 8 - 14: highest correspondences x15 , x44 , x57 , x68 ; F = c(e1 , e′5 ), c(e4 , e′4 ), c(e5 , e′7 ), c(e6 , e′8 ); for all rows i and columns j values x1j , xi5 , xi4 , x4j , x5j , xi7 , xi8 and x6j set to 0). Thus, all of them become a part of the final alignment. Correspondence between entities e7 and e′9 is not highest and is therefore not included into the final alignment: e7 has a greater correspondence with e′8 (0.73) than with e′9 (0.72). On the other hand, e′8 has greatest value correspondence with e6 , which leaves e7 unpaired at the end of the first iteration. In the second (and each next) iteration all correspondences related to entities that have already become part of the final alignment are cast out from calculation matrix (i.e. the entire row or column corresponding to each of them). In this way, we are able to find new highest correspondences in the reduced matrix. In the second iteration, the newly identified highest correspondences are c(e7 , e′9 ) and c(e8 , e′3 ). Since their values are higher than the threshold, they are both included into the final alignment. In the third iteration, two highest correspondences are identified: c(e3 , e′2 ) and c(e2 , e′6 ). Only the value of the former is greater than the threshold (0.22) and thus only the former enters the final alignment. After the third iteration the mutual correspondence for all unpaired entities (e2 in O; e′1 and e′6 in O′ ) are smaller than the threshold (Algorithm 4, line 14: all values xij < thrF IN ), which indicates the end of the algorithm. The final alignment consists of seven oneto-one correspondences: c(e1 , e′5 ), c(e4 , e′4 ), c(e5 , e′7 ), c(e6 , e′8 ), c(e7 , e′9 ), c(e8 , e′3 )

34

and c(e3 , e′2 ). 1028

1030

1032

1034

1036

1038

1040

1042

1044

1046

1048

1050

1052

1054

1056

1058

1060

1062

1064

1066

1068

1070

4.4. System architecture Our prototype software implements all basic matchers described in Section 4.1. It also implements the Autoweight++ algorithm for weighted aggregation of the results produced by the basic matchers (Section 4.2) as well as the method for computing the final alignment, which is our iterative extension of the Max1 algorithm (Section 4.3). Fig. 4 presents the UML Activity Diagram for the implemented prototype ontology matching system. The developed prototype is one of the contributions of this paper, since it enables an evaluation of the entire method (the evaluation will be discussed in the upcoming Section 5). The prototype software performs the following activities (steps). 1. Ontology data processing. Data related to each particular entity are extracted from ontologies. Afterwards, textual data are normalized by means of tokenization (text is divided into a set of basic terms) and by removing the stop-words (eliminating tokens that do not provide useful information). 2. Execution of string matchers. These five matchers (see Fig. 4) compare strings related to particular entities, trying to determine their mutual correspondence in this way. The matchers were presented in Section 4.1.1. 3. Weighted aggregation of string matchers using Autoweight++. After ontology data processing (step 1), string matchers are executed in parallel (step 2) and their results are aggregated into a string-based composition alignment (according to the design principles explained in Section 2.3) using weighted aggregation according to Autoweight++ (Section 4.2). The threshold parameter for highest correspondences, which they must exceed to influence the weighting factor calculation (Section 4.2.1) is 0.20. The value was chosen as a result of our unofficial testing on certain open tests of OAEI evaluation held in 2012 [59]). It was not chosen by analyzing the ontologies being compared, but is a constant that we set heuristically by observing the result of OAEI open tests. In future, we also plan to calculate the threshold using the same basic principles we applied in AutoWeight++ in order to get even better matching results. The primary advantage of parallel composition lies in the fact that each of the basic matchers produces an output of a different quality as the result of how the input ontologies are written. Parallel composition is able to consolidate those outputs well. 4. Execution of structure matchers. These matchers apply the structure of the ontologies submitted to the matching process in order to determine their mutual correspondence. As mentioned before, an aggregated entitybased alignment (which is actually based on string matchers; see Fig. 4) must already be defined in order to execute structure-based matchers (this alignment was produced in step 3). The matchers were presented in Section 4.1.2. 35

1072

1074

1076

1078

1080

1082

1084

1086

1088

1090

1092

1094

1096

1098

1100

1102

1104

1106

1108

1110

1112

5. Weighted aggregation of structure matchers using Autoweight++. After the parallel execution of structure matchers, their results must also be aggregated into a structure-based composition alignment. Again, we use weighted aggregation according to Autoweight++. 6. Weighted aggregation of string-based alignment and structurebased alignment using Autoweight++. The previously created composition alignments based on string matchers (step 3) and structure matchers (step 5) are united into a single common alignment (see Fig. 4), which represents the final correspondence for each combination of entities from ontologies O and O′ . Again, weighted aggregation is performed according to Autoweight++. 7. Final alignment. The common alignment produced in step 6, is submitted to the process of creating the final alignment. While all previous alignments were lists of correspondences representing matching candidates, the final alignment contains only those correspondences that really represent a match, i.e. list of few best possible correspondences. The final alignment method was described in Section 4.3. The prototype software has entirely been implemented in Java. The OAEI [19, 20] requires that the software submitted to their evaluation process is written in that programming language. The software takes at its input (Fig. 4) two ontologies written in OWL (O and O′ ), which are submitted to the matching process. The output alignment is written in the appropriate format, defined by OAEI and described in detail in [60]. 5. Evaluation The Benchmark test set [61] is the largest test set (in the context of the largest number of pairs of ontologies that have to be matched) in the ontology matching system evaluation organized by OAEI (Ontology Alignment Evaluation Initiative) [19, 20]. Benchmark, which is written in OWL, is used for comparison of the up-to-date matching systems, including our system, CroMatcher. The comparison among various matching systems presented in this section is based on the Benchmark biblio test subset, since it is the only subset test case of Benchmark that has been included in each of the OAEI evaluations held so far. Moreover, since the Benchmark biblio test subset has not changed since 2011, it is appropriate for comparing matching systems participating in different OAEI evaluations. Our matching system participated in the OAEI 2013 [62] and OAEI 2015 [63] evaluations, which consist of various test cases and are a part of the Ontology matching workshop, held annually in conjunction with the International Semantic Web Conference (ISWC). The Benchmark biblio test case contains more than 100 pairs of ontologies and the alignment results between them. In each pair of ontologies, the first ontology is the reference ontology containing all the information related to a specific domain described by that pair of ontologies. In the second ontology certain information (for instance, the structure of entities) is missing in order to test matching systems. 36

Figure 4: UML Activity Diagram describing the architecture of the CroMatcher system for ontology matching

1114

1116

1118

1120

1122

1124

1126

Hence, a matching system must find as many relevant correspondences as possible between such an incomplete ontology and the reference ontology. Thus, the Benchmark biblio test set is the starting point for comparison of different matching systems, because it detects all the advantages and disadvantages of a particular matching system when matching ontologies where certain information is missing. The Benchmark biblio test set is designed considering four main components in OWL ontologies: entity annotations, ontology structure, class properties and class individuals. In accordance with the OWL and ontology matching terminology presented in Sections 2.1 and 2.2, the term entity (in this particular case entity annotations) refers to both ontology classes and ontology properties. When a specific aspect of ontology matching process refers to classes and properties together, then the term entity is used. Otherwise the term class or the term property is used. The advantages and disadvantages of each matching system can be exam-

37

1128

1130

ined according to the missing component in a particular Benchmark biblio test set. OAEI uses the following evaluation measures for a comparison of ontology matching systems: • Precision measures the ratio of correctly found correspondences (#correctly f ound corr) over the total number of correspondences returned by the matching system (#total f ound corr) [2]: P recision =

#correctly f ound corr #total f ound corr

(15)

• Recall measures the ratio of correctly found correspondences (#correctly f ound corr) over the total number of all correct correspondences between two ontologies (#all correct corr) [2]. Recall =

#correctly f ound corr #all correct corr

(16)

• F-measure is the harmonic mean of Precision and Recall [2]: F − measure =

1132

1134

1136

1138

1140

1142

1144

1146

1148

1150

1152

(1 + β 2 ) · P recision · Recall β 2 · P recision + Recall

(17)

Depending on the parameter β, the importance of Precision and Recall can be controlled. Precision and Recall are usually valued equally; therefore, the value of the parameter β is set to 1. The results for seven matching systems (together with the first version of our system described in the IJMSO journal and thus denoted as CroMatcher – IJMSO), which achieved the best F-measure score in the Benchmark biblio test set during the last three OAEI evaluations (2013 [27], 2014 [28] and 2015 [29]), are presented in the remainder of this section. These seven matching systems are: our matching system CroMatcher, YAM++, Lily, CIDER-CL, IAMA, ODGOMS and WikiMatch. The systems that participated in OAEI 2014 are not included in the following evaluation study, because the best results achieved in OAEI 2014 against the Benchmark biblio test set (Xmap++ [64], AOT/AOTL [65], RDSL [66]) were worse than any of the seven aforementioned systems that took part in OAEI 2012 (WikiMatch), OAEI 2013 (YAM++, CIDERCL, IAMA, ODGOMS) and OAEI 2015 (CroMatcher, Lily) against the same test set. The MaasMatch system [67] achieved the same results (F-measure = 0.69) in the evaluation of the Benchmark biblio test set as the WikiMatch system (WikiMatch has the poorest results among the best seven systems on the Benchmark biblio test set), but MaasMatch did not produce results for the entire Benchmark biblio test set. Therefore, we decided not to include this system in our evaluation. As stated before, entity annotations contain most of the information about an entity in an OWL ontology. Therefore, when entity annotations are defined 38

1154

1156

1158

within the pair of testing ontologies, the evaluation results will be very high regardless of the fact that some other main component (ontology structure, properties or instances) is missing within the ontologies. The values of Precision, Recall and F-measure are shown in Fig. 5. for each matching system, when matching those ontologies in the Benchmark biblio test set that do contain entity annotations.

Figure 5: Performance comparison when matching ontologies that contain entity annotations: Benchmark biblio tests - 221, 222, 223, 224, 225, 228, 232, 233, 236, 237, 238, 239, 240, 241, 246, 247 1160

1162

1164

1166

1168

1170

1172

1174

1176

1178

1180

1182

The value of the Precision measure for matching systems CroMatcher, YAM++, Lily and ODGOMS is equal to 1. Other systems also have a high Precision value, but slightly lower than the three aforementioned systems. The results show that all the systems select almost no false correspondences between ontologies when entity annotations are defined. The value of the Recall measure is also very high for all matching systems. Therefore, it can be concluded that matching systems recognize the majority of expected correct correspondences between compared ontologies. The matching systems CroMatcher, YAM++, Lily and ODGOMS achieved the maximal value (equal to 1) for F-measure, due to maximal values of Precision and Recall. The value of F-measure is higher than 0.9 for all tested systems, so the assumption that matching systems successfully determine correspondences between entities with defined annotations is confirmed. In the Benchmark biblio test set there are also ontologies containing entities that do not have defined meaningful annotations (labels and comments are actually strings without meaning, e.g. “sdgshfhfs”). These test ontologies show the quality of matching systems when the matching data have to be extracted from ontology structure, properties and instances. The values of Precision, Recall and F-measure for each matching system are shown in Fig. 6. The Precision value is very high for every matching system (CIDER-CL has somewhat lower value of Precision than other matching systems). The Precision value for our matching system, CroMatcher is slightly lower than the highest Precision value (the ODGOMS matching system). However, the results for the Recall measure must be included in order to show the overall matching quality. The ODGOMS system, which has the highest Precision value, has a quite

39

Figure 6: Performance comparison when matching ontologies that do not contain meaningful entity annotations: Benchmark biblio tests - 201, 201-2, 201-4, 201-6, 201-8, 202, 202-2, 202-4, 202-6, 202-8

1184

1186

1188

1190

1192

1194

1196

1198

1200

1202

1204

1206

1208

1210

1212

poor value for Recall. All correspondences found by ODGOMS are correct correspondences (high Precision), but a large number of correct correspondences was not found (low Recall). Although the matching systems CroMatcher, Lily and YAM++ have slightly lower Precision value than ODGOMS, the Recall value for these three systems is very high. These three systems determine a large number of actually existing correspondences (high Recall), and a small number of incorrect correspondences (not best, but still high Precision), as expressed by F-measure, where CroMatcher and YAM++ achieved by far the best results in comparison with other systems. In the subsequent tests, the mutual relation of Precision and Recall values will follow the same pattern. The systems with highest Precision will have low Recall (IAMA, ODGOMS and WikiMatch). It means that there are very high criteria of determining correct correspondences in these systems, and only those correspondences with very high value are included in the final alignment. In this particular test as well as other tests in the remainder of this section the difference between the Precision values obtained by different systems is always smaller than the difference between the Recall values. Thus, CroMatcher, Lily and YAM++, which are deliberately aimed at high Recall (at the expense of slightly lower Precision), also have a higher value of F-measure i.e. better overall performance than the systems aimed at very high Precision. It can be concluded that the criteria for determining relevant correspondences are not so well balanced in systems aimed at high Precision (IAMA, ODGOMS and WikiMatch). On the other hand, CroMatcher achieved a good balance that keeps high values for both Precision and Recall. In this test, performed on ontologies without meaningful annotations defined, Lily achieved the best results (CroMatcher was the second-best with slightly lower F-measure). Therefore, the results obtained by basic matchers that use the information from structure, properties and instances for matching ontologies, confirmed the right selection of basic matchers in CroMatcher and also a high quality weighted aggregation of these matchers. Clearly, the Autoweight++

40

1214

1216

1218

1220

1222

1224

1226

1228

1230

1232

1234

method for computing the weighting factors in the weighted aggregation, which is one of the contributions of this paper, is an important factor for the robustness of the matching system, when dealing with ontologies that do not have their entity annotations defined. Our previous version of the system, CroMatcher – IJMSO achieved poorer results than CroMatcher, but it can be placed alongside IAMA, ODGOMS and WikiMatch based on the result achieved in this portion of the Benchmark biblio test set. In order to verify the results of the previous test, we apply another test set, consisting of ontologies that lack annotations (labels and comments) of entities and in addition do not have one of the other important components (properties, structure or instances) defined. Therefore, each test pair of ontologies contains ontologies that have only two main components defined: properties and structure, properties and instances or structure and instances. When matching these ontologies, the greatest challenge is to find the best procedure of aggregating results obtained by basic matchers that use information from two defined ontology components in order to achieve the best possible matching results. In Fig. 7, the results for Precision, Recall and F-measure are presented for every matching system when matching ontologies from the Benchmark biblio test set that contain neither entity annotations nor one of other important ontology components (structure, properties or instances).

Figure 7: Performance comparison when matching ontologies that contain neither entity annotations nor one of other three important ontology components (structure, properties or instances): Benchmark biblio tests - 248, 248-2, 248-4, 248-6, 248-8, 249, 249-2, 249-4, 249-6, 249-8, 250, 250-2, 250-4, 250-6, 250-8, 251, 251-2, 251-4, 251-6, 251-8

1236

1238

1240

1242

1244

As in the previous test, the value of Precision is high for each system, except CIDER-CL and CroMatcher – IJMSO. CroMatcher has a slightly lower Precision value than the systems with the highest value (IAMA and WikiMatch), but the values for Recall and F-measure will show again the real quality of the matching results. CroMatcher, Lily and YAM++ achieved by far the best results for both Recall and F-measure in comparison with other matching systems. Furthermore, for both measures CroMatcher achieved a better result than Lily and YAM++. This indicates that the method for final alignment used by CroMatcher accurately determines the correct correspondences between the compared ontologies. Since the correspondences resulting from various basic 41

1246

1248

1250

1252

1254

1256

1258

1260

1262

matchers had to be properly aggregated in order to achieve accurate matching results in this test, conclusion may be inferred that the Autoweight++ method efficiently aggregated the obtained results once more. Next, we performed tests on ontology sets containing ontologies in which entity annotations (i.e. labels and comments) are missing and only one of other important components (properties, structure or instances) is defined i.e. contains useful matching information. Therefore, the matching result for each test pair of ontologies is obtained by comparing only one ontology component: properties, instances, or the ontology structure. These sets of test ontologies will show which system has the best individual basic matchers. For example, a test pairs of ontologies that contain only information about the ontology structure will show which matching system has the best individual basic matchers that find correspondences based on information from ontology structure. There are three separate test sets, one for each main ontology component. The first test set consists of ontologies where the set of instances is the only defined component. The second test set consists of ontologies in which the set of properties is the only defined component. In the third one, the ontology structure is the only defined component. The results obtained when matching ontologies with defined instances are presented first. The results are shown in Fig. 8. The value of Precision is very

Figure 8: Performance comparison when matching ontologies in which the instance set is the only defined ontology component: Benchmark biblio tests - 254, 254-2, 254-4, 254-6, 254-8, 260, 260-2, 260-4, 260-6, 260-8 1264

1266

1268

1270

1272

1274

high for all systems. Again, IAMA, ODGOMS and WikiMatch achieved the best results. However, the Recall value is very low for these systems; hence their Fmeasure values are not very good. The YAM++ system achieved the best result for Recall. The only systems whose Recall is close to YAM++ are Lily and our system, CroMatcher. YAM++, CroMatcher and Lily have the highest values for F-measure, and the YAM++ system achieved slightly better results. It can be concluded that the basic matchers used by YAM++, Lily and CroMatcher to find correspondences based on instances exploit this information in the best way. Our previous version of the system, CroMatcher - IJMSO achieved poor results which was expected, since it does not contain any basic matcher that determines correspondences based on information from the defined instances. 42

1276

1278

Next, the matching systems are tested on ontologies in which the property set is the only defined ontology component (contains most of the matching information) within these ontologies. The results for measures Precision, Recall and F-measure are presented in Fig. 9.

Figure 9: Performance comparison when matching ontologies in which the property set is the only defined ontology component: Benchmark biblio tests - 253, 253-2, 253-4, 253-6, 253-8 1280

1282

1284

1286

1288

1290

1292

1294

1296

1298

1300

1302

1304

1306

The values for Precision and Recall are similar to those for the ontologies where only instances are defined. CroMatcher, Lily and YAM++ have by far the best results for the Recall measure. Moreover, in this test our system CroMatcher achieved better Recall than YAM++ and Lily. It can be seen that the systems IAMA, ODGOMS and WikiMatch again achieved low Recall, indicating that these systems detect only those entities correspondences that have a very high value and are certainly correct. Analyzing the values of F-measure, CroMatcher, Lily and YAM++ are much better than other systems. CroMatcher achieved slightly better results than YAM++ and Lily. Therefore, it can be concluded that the basic matchers used by CroMatcher to find correspondences based on properties information exploit these information in the best way. Our previous version of the system, CroMatcher - IJMSO achieved results similar to IAMA, ODGOMS and WikiMatch. Considering the three most important ontology components (properties, structure and instances), the remaining test set contains ontologies with ontology structure as the only component defined within these ontologies. In Fig. 10 the results for measures Precision, Recall and F-measure are presented for this test set. As with the two previous test sets with one defined ontology component, all systems have high Precision value, except CroMatcher, Lily and YAM++. IAMA, ODGOMS and WikiMatch, in particular, are again those with the best Precision. CroMatcher, Lily, YAM++ and CIDER-CL have the highest value of Recall, as well as the highest value of F-measure. Here, the final results (Fmeasure) of YAM++ are slightly better than the results of CroMatcher, Lily and CIDER-CL. While CIDER-CL has the best Precision among the latter three systems, it achieved very low Recall (0.06) in the test 257, which reduced the impact of the high Precision value (1); therefore, its F-measure was very low (0.11) for the test 257. Consequently, the CIDER-CL system achieved approximately 43

Figure 10: Performance comparison when matching ontologies in which the ontology structure is the only defined ontology component: Benchmark biblio tests - 257, 257-2, 257-4, 257-6, 257-8, 266

1308

1310

1312

1314

1316

1318

1320

1322

1324

the same overall results as the systems YAM++, Lily and CroMatcher. Considering the results for the last three test sets (pairs of ontologies that consist of only one of the main defined components), CroMatcher, Lily and YAM++ outperform other systems. It can be concluded that our system, CroMatcher, has the best basic matchers that exploit information from entity properties, while YAM++ has the best basic matchers that exploit information from ontology structure and entity instances. However, ontologies usually consist of several implemented components that contain information about entities. Therefore, it is very important to efficiently aggregate the results of all basic matchers that exploit the matching information from different ontology components. According to the previous results, it can be concluded that our weighted aggregation with the Autoweight++ method (presented in Section 4.2) very successfully aggregates the results of different basic matchers. A test set containing ontologies where all ontology components (annotations, structure, properties and instances) are partially or even completely missing (thus either partially usable or not usable at all) supports this conclusion (Fig. 11).

Figure 11: Performance comparison when matching ontologies where all ontology components are partially or even completely missing: Benchmark biblio tests - 262, 262-2, 262-4, 262-6, 262-8, 265

44

1326

1328

1330

1332

1334

1336

1338

1340

The value of the Precision is very high for all systems except YAM++, but the value of Recall will compensate this lower result of YAM++. CroMatcher, Lily, YAM++ and CIDER-CL achieved the best results for Recall, although the value itself is not very high. The lower value is a logical consequence of the fact that all of the major components in the test ontologies are (partially) missing. The results for F-measure show that CroMatcher, Lily, YAM++ and CIDER-CL have the best overall results for this test set. Our system, CroMatcher, has slightly better results than YAM++, Lily and CIDER-CL. In other words, the basic matchers of our system effectively exploited the information from partially defined components of the compared ontologies, while the weighted aggregation with Autoweight++ method aggregated the obtained results of these basic matchers very successfully. While all previous experiments referred to a particular portion of the Benchmark biblio test set in order to analyze particular components of matching systems (basic matchers, aggregation, etc.), we will now finally analyze the performance of each matching system on the entire test set [61]. In Fig. 12 the overall results can be seen.

Figure 12: Performance comparison for the entire Benchmark biblio test set 1342

1344

1346

1348

1350

1352

1354

1356

The IAMA system has the highest Precision value. The systems with the highest Recall value have slightly lower Precision, but these systems have a more balanced ratio between the values of Precision and Recall (expressed by F-measure) as opposed to the IAMA system. The systems CroMatcher, Lily, YAM++ and CIDER-CL decreased the strict correspondence determination criteria in order to achieve high Recall, and thus improve the overall performance of finding the correspondences between ontologies. The values of F-measure show that the best results were obtained by CroMatcher, Lily and YAM++. Lily has slightly better Recall value, while YAM++ has slightly better Precision value. The method for automatically computing the final alignment, which is one of the contributions of this paper and the integral part of our system CroMatcher, was not individually tested with respect to other final alignment methods, but only within the entire system matching results. However, it can be concluded that this method successfully determines the final alignment, because otherwise 45

1358

1360

1362

1364

1366

1368

1370

1372

1374

1376

1378

1380

1382

1384

1386

1388

1390

1392

1394

1396

1398

1400

1402

the entire system would not achieve good results. Analyzing the results of the entire test set, it can be assumed that Lily has a very efficient method for final alignment (as mentioned in Section 3, this method deals with various irregularities within correspondences in the final alignment). Although the method has not been evaluated individually, the system achieved the best F-measure, which would not be possible to do without a very efficient alignment method. Considering the entire evaluation process, Lily achieved the best results for the entire Benchmark biblio test set, but YAM++ and our system, CroMatcher, are very close to Lily. YAM++ achieved better results for ontologies where one major component ontology (structure, instances or properties) contains most of the matching information. Our system, CroMatcher, achieved better results when matching information from different ontology components have to be merged together. When all ontology components except entity annotations are defined, Lily achieved the best results. Accordingly, all three systems have some comparative advantages that could be exploited in order to build a new system intended to achieve better matching results than Lily, YAM++ and CroMatcher individually. Such a system should contain the basic matchers of YAM++ that exploit information from instances or ontology structure. Also, the new system should contain the CroMatcher basic matchers that exploit information of class properties. Furthermore, according to the results when matching ontologies that contain multiple implemented components, we assume that our modified weighted aggregation with Autoweight++, together with the weighted aggregation of the Lily system, is the best solution for the aggregation of the basic matcher results. The results of the Benchmark biblio test set confirmed that our CroMatcher system achieves better results than the first version of our system, CroMatcher – IJMSO. Additional comparison between these systems has been made to show how each newly proposed component in our system improves the overall matching result. In Table 3 comparison between the first and the last row represents the results for our system (hereinafter CroMatcher 2015) and the first version of our system (CroMatcher – IJMSO). In each of the four middle rows the results of CroMatcher 2015, without one of the newly proposed components, are presented. It can be seen that the absence of any newly proposed component makes the results worse. Moreover, when all of the newly proposed components are missing (CroMatcher – IJMSO), the system achieved the worst results. Therefore, it can be concluded that the newly proposed components of our CroMatcher system significantly improve the matching results (in particular, Recall and F-measure). Although our primary objective was to prepare our system for the Benchmark biblio test set, we will discuss briefly the results that our system achieved for other two well-known test sets in the OAEI evaluation, Anatomy [68] and Conference [69]. In OAEI 2015, our system succeeded to finish the matching process for the Anatomy test set for the first time (in OAEI 2013 our system did not finish in time), and it achieved the fifth best result on that test set. Analyzing the Anatomy test set, we notice that the entities of the Anatomy test set ontologies have several relations (oboInOwl#hasRelatedSynonym, oboInOwl#hasDefinition 46

Table 3: Performance comparison of different versions of our CroMatcher system on the entire Benchmark biblio test set

CroMatcher (CM) 2015 CM - Autoweight ++ without substituting nonexistent values CM - Autoweight++ without MaxColumnRow threshold CM - Final alignment without iterative process CM - architecture without newly proposed basic matchers CroMatcher - IJSMO

1404

1406

1408

1410

1412

1414

1416

1418

1420

1422

1424

1426

1428

1430

1432

Precision 0.95 0.96

Recall 0.82 0.70

F-measure 0.88 0.81

0.96

0.76

0.85

0.96

0.74

0.83

0.86

0.6

0.71

0.94

0.52

0.67

and oboInOwl#hasAlternativeId) that are introduced as additional OWL AnnotationProperty constructs and contain a lot of information about a particular entity. We believe that when we include these relations in the matching process (which we expect to present at the OAEI 2016), the results will be even better. The annotations oboInOwl#hasRelatedSynonym and oboInOwl#hasDefinition contain the synonym name and the definition of a certain entity. Since these annotations describe entities very accurately, exploiting that information would definitely improve the results. The annotation oboInOwl#hasAlternativeId contains an alternative ID of the entity. The IDs can improve the results when matching two ontologies using a mediator ontology like the Uber Anatomy Ontology (Uberon) [70], which contains all IDs of entities that describe the same individual within different anatomy ontologies. Considering the results of the Conference test set, our system did not achieve good matching results. However, analyzing the reference alignments (there were 21 reference alignments) between ontologies in this test set, there is an average of 14 correct correspondences between two compared ontologies. If one system determines two correct correspondences more than another system, the value of recall measure between these two systems will differ by 0.15 (which is 2/14). The same conclusion can be applied for the precision value. Therefore, the difference that our system needs to catch up with the best systems in Conference test set is actually in two more correctly found correspondences. Analyzing each individual ontology in this test set, we concluded that we should insert a basic matcher that matches the entities according to the language similarities (synonyms, hypernym) between terms contained in the annotations of entities. Therefore, we added a simple basic matcher that is based on WordNet [32] and determines the synonyms between entity annotations. After this change, we tested our system on a local machine and we got much better results for the Conference test set (we would hold the third place, in comparison with the results of other systems that participated in OAEI 2015). We have started working on improving this basic matcher and our matching results for the Conference test set. Considering the novelties in the CroMatcher that ran at OAEI 2015, with

47

1434

1436

1438

1440

1442

1444

1446

1448

1450

1452

1454

1456

1458

1460

1462

1464

1466

1468

1470

1472

1474

1476

respect to the version running at OAEI 2013, we primarily focused on improving the execution time. We made some important algorithmic changes in the program code in order to speed up our system. At OAEI 2013 (tested officially on Dell PowerEdge T610 with two Intel Xeon Quad Core 2.26GHz E5607 processors, 32GB of RAM, Java 1.8, Java heap size 8GB) the execution time of CroMatcher was 1114 seconds for the entire Benchmark biblio test set. No official data on execution times at OAEI 2015 has been available at the moment of writing this paper. However, our local test on a Dell laptop with a single Intel i5-2450 2.50 GHz processor, 4GB of RAM, Java 1.8 and Java heap size 1.5GB results in execution time of 493 seconds, which is more than two times better, despite the latter execution is performed within a significantly inferior environment. 6. Conclusion In this paper we presented CroMatcher, our ontology matching system, which automatically performs all phases of the ontology matching process. We proposed the Autoweight++ method, an enhanced version of our earlier Autoweight method. We also proposed a new, iterative method for producing the final alignment between the compared ontologies. Both methods are implemented within the CroMatcher ontology matching system. In our matching system, we implemented nine basic matchers. Each basic matcher exploits information from certain ontology components in order to match particular ontology entities. In our matching system basic matchers are arranged into a parallel composition, where every matcher is executed independently. After that, the obtained alignments are united into a single alignment using the Autoweight++ aggregation method. In Autoweight++, our method that automatically calculates the weighting factors for the basic matchers, we proposed some major enhancements to our previous weighted aggregation method. First, we introduced a new rule for selecting relevant correspondences that participate in the calculation of weighting factors. Second, we solved the problem of nonexistent correspondences during the aggregated correspondence calculation, which are replaced with the average of the correspondences between the two entities obtained by other basic matchers. We also introduced a new iterative final alignment calculation method. In each iteration, only correspondences that have the maximum value from the perspective of both ontology entities and that also satisfy an imposed threshold are included in the final alignment. The evaluation of the CroMatcher ontology matching system was performed through its participation in the OAEI evaluation contest. Its performance on the Benchmark biblio test set was compared to those of the other state-of-theart matching systems. The results show that our matching system achieves the best scores for a large number of test cases within Benchmark biblio. The score achieved by our software was particularly outstanding for the test cases where ontologies lacked one or two important components, which puts into focus 48

1478

1480

1482

1484

1486

1488

1490

1492

1494

1496

1498

the question of aggregating the basic matchers with respect to the ontology components used for correspondence calculation. For this showcase tests the presented method Autoweight++ performed particularly well, being able to estimate the quality of results achieved by each basic matcher and to assign higher weight to matchers producing the results considered more important, which consequently contributes to a higher quality of aggregated results. On the other hand, in test cases where only one of the four important components is defined, our choice of basic matchers came to the fore. Our system performed particularly well for ontologies where only entity annotations or only properties are defined. According to the achieved evaluation scores, the presented method for final alignment calculation also performs well, since the performance of our system is either the best, second-best or third-best considering different conducted tests within the Benchmark biblio test set. Conclusively, the implemented ontology matching system, which comprises the proposed Autoweight++ method and the proposed final alignment calculation method, accomplished high scores and a successful showcase performance. In future work we will focus on a detailed analysis of particular basic matchers, in order to implement their enhanced versions aimed at achieving even higher test scores. Moreover, we will try to improve the matcher algorithms in order to speed up their execution. References

1500

1502

1504

1506

1508

1510

1512

1514

1516

[1] P. Borst, H. Akkermans, J. L. Top, Engineering ontologies, Int. J. Hum.Comput. Stud. 46 (2) (1997) 365–406. [2] J. Euzenat, P. Shvaiko, Ontology matching, 2nd Edition, Springer-Verlag, Heidelberg (DE), 2013. [3] M. Mao, Y. Peng, M. Spring, A harmony based adaptive ontology mapping approach, in: H. R. Arabnia, A. Marsh (Eds.), Proc. of the 2008 Int. Conf. on Semantic Web & Web Services, SWWS 2008, July 14-17, 2008, Las Vegas, Nevada, USA, CSREA Press, 2008, pp. 336–342. [4] M. Guli´c, I. Magdaleni´c, B. Vrdoljak, Automatically specifying parallel composition of matchers in ontology matching process, in: E. G. Barrio¨ urk (Eds.), Metadata and Semantic canal, Z. Cebeci, M. C. Okur, A. Ozt¨ Research - 5th Int. Conf., MTSR 2011, Izmir, Turkey, October 12-14, 2011. Proc., Vol. 240 of Communications in Computer and Information Science, Springer, 2011, pp. 22–33. [5] M. Guli´c, I. Magdaleni´c, B. Vrdoljak, Automated weighted aggregation in an ontology matching system, Int. J. Metadata, Semantics and Ontologies 7 (1) (2012) 55–64.

49

1520

[6] D. Aumueller, H. H. Do, S. Massmann, E. Rahm, Schema and ontology ¨ matching with COMA++, in: F. Ozcan (Ed.), Proc. of the ACM SIGMOD Int. Conf. on Management of Data, Baltimore, Maryland, USA, June 14-16, 2005, ACM, 2005, pp. 906–908.

1522

[7] D. Ngo, Z. Bellahsene, YAM++ results for OAEI 2013, in: Shvaiko et al. [27], pp. 211–218.

1518

1524

1526

1528

1530

1532

1534

1536

1538

1540

1542

[8] J. Gracia, K. Asooja, Monolingual and cross-lingual ontology matching with CIDER-CL: evaluation report for OAEI 2013, in: Shvaiko et al. [27], pp. 109–116. [9] Y. Zhang, X. Wang, S. He, K. Liu, J. Zhao, X. Lv, IAMA results for OAEI 2013, in: Shvaiko et al. [27], pp. 123–130. [10] I. Kuo, T. Wu, ODGOMS - results for OAEI 2013, in: Shvaiko et al. [27], pp. 153–160. [11] S. Hertling, H. Paulheim, WikiMatch results for OEAI 2012, in: Shvaiko et al. [59]. [12] M. Ehrig, Y. Sure, Ontology mapping - an integrated approach, in: C. Bussler, J. Davies, D. Fensel, R. Studer (Eds.), The Semantic Web: Research and Applications, First European Semantic Web Symposium, ESWS 2004, Heraklion, Crete, Greece, May 10-12, 2004, Proc., Vol. 3053 of Lecture Notes in Computer Science, Springer, 2004, pp. 76–91. [13] N. Jian, W. Hu, G. Cheng, Y. Qu, Falcon-ao: Aligning ontologies with Falcon, in: Proc. of K-CAP Workshop on Integrating Ontologies, Banff, Canada, October 2, 2005, 2005, pp. 85–91. [14] Y. Qu, W. Hu, G. Cheng, Constructing virtual documents for ontology matching, in: L. Carr, D. D. Roure, A. Iyengar, C. A. Goble, M. Dahlin (Eds.), Proc. of the 15th Int. Conf. on World Wide Web, WWW 2006, Edinburgh, Scotland, UK, May 23-26, 2006, ACM, 2006, pp. 23–31.

1546

[15] J. Tang, J. Li, B. Liang, X. Huang, Y. Li, K. Wang, Using bayesian decision for ontology mapping, Web Semantics: Science, Services and Agents on the World Wide Web 4 (4) (2006) 243 – 262.

1548

[16] P. Lambrix, H. Tan, Sambo-a system for aligning and merging biomedical ontologies, J. Web Semant. 4 (3) (2006) 196–206.

1544

1550

1552

[17] M. H. Seddiqui, M. Aono, An efficient and scalable algorithm for segmented alignment of ontologies of arbitrary size, J. Web Semant. 7 (4) (2009) 344– 356. [18] W. Wang, P. Wang, Lily results for OAEI 2015, in: Shvaiko et al. [29], pp. 162–170.

50

1556

[19] J. Euzenat, C. Meilicke, H. Stuckenschmidt, P. Shvaiko, C. T. dos Santos, Ontology alignment evaluation initiative: Six years of experience, J. Data Semantics 15 (2011) 158–192.

1558

[20] Ontology alignment evaluation initiative, http://oaei.ontologymatching.org/, accessed: 2016-03-02.

1554

[21] G. Antoniou, F. van Harmelen, A Semantic Web Primer, MIT Press, 2004. 1560

1562

1564

[22] World wide web consortium, http://www.w3.org/, accessed: 2016-03-02. [23] M. K. Smith, C. Welty, D. L. M. (Eds.), OWL Web Ontology Language Guide, W3C Recommendation 10 February 2004, http://www.w3.org/TR/2004/REC-owl-guide-20040210/, accessed: 2016-03-02.

1568

[24] D. L. McGuinness, F. van Harmelen (Eds.), OWL Web Ontology Language Overview, W3C Recommendation 10 February 2004, http://www.w3.org/TR/2004/REC-owl-features-20040210/, accessed: 2016-03-02.

1570

[25] A. Gabillon, Q. Z. Sheng, W. Mansoor (Eds.), Web-Based Information Technologies and Distributed Systems, Atlantis Press, 2010.

1572

[26] P. Shvaiko, J. Euzenat, Ontology matching: State of the art and future challenges, IEEE Trans. Knowl. Data Eng. 25 (1) (2013) 158–176.

1566

1574

1576

1578

1580

1582

1584

1586

1588

1590

[27] P. Shvaiko, J. Euzenat, K. Srinivas, M. Mao, E. Jim´enez-Ruiz (Eds.), Proc. of the 8th Int. Workshop on Ontology Matching co-located with the 12th Int. Semantic Web Conf. (ISWC 2013), Sydney, Australia, October 21, 2013, Vol. 1111 of CEUR Workshop Proceedings, CEUR-WS.org, 2013. [28] P. Shvaiko, J. Euzenat, M. Mao, E. Jim´enez-Ruiz, J. Li, A. Ngonga (Eds.), Proc. of the 9th Int. Workshop on Ontology Matching collocated with the 13th Int. Semantic Web Conf. (ISWC 2014), Riva del Garda, Trentino, Italy, October 20, 2014, Vol. 1317 of CEUR Workshop Proceedings, CEURWS.org, 2014. [29] P. Shvaiko, J. Euzenat, E. Jim´enez-Ruiz, M. Cheatham, O. Hassanzadeh (Eds.), Proc. of the 10th Int. Workshop on Ontology Matching collocated with the 14th Int. Semantic Web Conf. (ISWC 2015), Bethlehem, PA, USA, October 12, 2015, Vol. 1545 of CEUR Workshop Proceedings, CEURWS.org, 2016. [30] H. H. Do, E. Rahm, COMA - A system for flexible combination of schema matching approaches, in: VLDB 2002, Proc. of 28th Int. Conf. on Very Large Data Bases, August 20-23, 2002, Hong Kong, China, Morgan Kaufmann, 2002, pp. 610–621.

51

1592

[31] V. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals, Soviet Physics Doklady 10 (1966) 707–710.

1594

[32] G. A. Miller, WordNet: A lexical database for english, Commun. ACM 38 (11) (1995) 39–41.

1596

[33] D. Ngo, Z. Bellahsene, YAM++ results for OAEI 2012, in: Shvaiko et al. [59].

1598

1600

1602

1604

1606

1608

1610

[34] D. Ngo, Z. Bellahsene, R. Coletta, YAM++ results for OAEI 2011, in: P. Shvaiko, J. Euzenat, T. Heath, C. Quix, M. Mao, I. F. Cruz (Eds.), Proc. of the 6th Int. Workshop on Ontology Matching, Bonn, Germany, October 24, 2011, Vol. 814 of CEUR Workshop Proceedings, CEUR-WS.org, 2011. [35] D. Ngo, Z. Bellahsene, R. Coletta, A generic approach for combining linguistic and context profile metrics in ontology matching, in: R. Meersman, T. S. Dillon, P. Herrero, A. Kumar, M. Reichert, L. Qing, B. C. Ooi, E. Damiani, D. C. Schmidt, J. White, M. Hauswirth, P. Hitzler, M. K. Mohania (Eds.), On the Move to Meaningful Internet Systems: OTM 2011 - Confederated Int. Conf.s: CoopIS, DOA-SVI, and ODBASE 2011, Hersonissos, Crete, Greece, October 17-21, 2011, Proc., Part II, Vol. 7045 of Lecture Notes in Computer Science, Springer, 2011, pp. 800–807. [36] S. Melnik, H. Garcia-Molina, E. Rahm, Similarity flooding: A versatile graph matching algorithm and its application to schema matching, in: Data Engineering, 2002. Proc. 18th Int. Conf. on, 2002, pp. 117–128.

1614

[37] W. E. Winkler, String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage, in: Proc. of the Section on Survey Research, 1990, pp. 354–359.

1616

[38] T. F. Smith, M. S. Waterman, Comparison of biosequences, Advances in Applied Mathematics 2 (4) (1981) 482 – 489.

1612

1618

1620

1622

1624

1626

[39] A. Monge, C. Elkan, The field matching problem: Algorithms and applications, in: In Proc. of the 2nd Int. Conf. on Knowledge Discovery and Data Mining, 1996, pp. 267–270. [40] J. J. Jiang, D. W. Conrath, Semantic similarity based on corpus statistics and lexical taxonomy, in: Proc. of the 10th Int. Conf. on Research in Computational Linguistics, ROCLING’97, 1997. [41] Z. Wu, M. Palmer, Verbs semantics and lexical selection, in: Proc. of the 32nd Annual Meeting on Assoc. for Computational Linguistics, ACL ’94, Association for Computational Linguistics, 1994, pp. 133–138. [42] G. Salton, M. J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, Inc., New York, NY, USA, 1986.

52

1628

1630

[43] R. A. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999. [44] H. W. Kuhn, The hungarian method for the assignment problem, Naval Research Logistics Quarterly 2 (1-2) (1955) 83–97.

1636

[45] W. W. Cohen, P. D. Ravikumar, S. E. Fienberg, A comparison of string distance metrics for name-matching tasks, in: S. Kambhampati, C. A. Knoblock (Eds.), Proc. of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), August 9-10, 2003, Acapulco, Mexico, 2003, pp. 73–78.

1638

[46] M. Smith, Neural Networks for Statistical Modeling, Van Nostrand Reinhold, New York, NY, USA, 1993.

1632

1634

1640

1642

1644

1646

1648

1650

1652

1654

1656

1658

[47] P. Wang, B. Xu, Y. Zhou, Extracting semantic subgraphs to capture the real meanings of ontology elements, Tsinghua Science and Technology 15 (6) (2010) 724 – 733. [48] C. Faloutsos, K. S. McCurley, A. Tomkins, Fast discovery of connection subgraphs, in: W. Kim, R. Kohavi, J. Gehrke, W. DuMouchel (Eds.), Proc. 10th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22-25, 2004, ACM, 2004, pp. 118–127. [49] P. Wang, Research on the key issues in ontology mapping (in Chinese), Ph.D. thesis, Southeast University, Nanjing (2009). [50] P. Wang, B. Xu, An effective similarity propagation method for matching ontologies without sufficient or regular linguistic information, in: A. G´omez-P´erez, Y. Yu, Y. Ding (Eds.), The Semantic Web, Fourth Asian Conference, ASWC 2009, Shanghai, China, December 6-9, 2009. Proceedings, Vol. 5926 of Lecture Notes in Computer Science, Springer, 2009, pp. 105–119. [51] P. Wang, B. Xu, Debugging ontology mapping: A static approach, Computing and Informatics 27 (1) (208) 2136. [52] L. Bergroth, H. Hakonen, T. Raita, A survey of longest common subsequence algorithms, in: Proc. of the Seventh Int. Symposium on String Processing Information Retrieval (SPIRE’00), SPIRE ’00, IEEE Computer Society, Washington, DC, USA, 2000, pp. 39–48.

1662

[53] G. Stoilos, G. Stamou, S. Kollias, A string metric for ontology alignment, in: Proc. of the 4th Int. Conf. on The Semantic Web, ISWC’05, 2005, pp. 624–637.

1664

[54] Wikipedia, the free encyclopedia, http://en.wikipedia.org, accessed: 2016-03-02.

1660

53

1666

1668

1670

1672

1674

1676

1678

1680

1682

1684

1686

[55] D. Ritze, H. Paulheim, Towards an automatic parameterization of ontology matching tools based on example mappings, in: Proc. 6th Int. Workshop on Ontology Matching, Bonn, Germany, October 24, 2011, 2011. [56] Y. Lee, M. Sayyadian, A. Doan, A. S. Rosenthal, eTuner: Tuning schema matching software using synthetic scenarios, VLDB J. 16 (1) (2007) 97–122. [57] E. Peukert, J. Eberius, E. Rahm, A self-configuring schema matching system, in: A. Kementsietsidis, M. A. V. Salles (Eds.), Proc. IEEE 28th Int. Conf. on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1-5 April, 2012, IEEE Computer Society, 2012, pp. 306–317. [58] M. Guli´c, I. Magdaleni´c, B. Vrdoljak, Ontology matching using TF/IDF measure with synonym recognition, in: T. Skersys, R. Butleris, R. Butkiene (Eds.), Information and Software Technologies, Vol. 403 of Communications in Computer and Information Science, Springer Berlin Heidelberg, 2013, pp. 22–33. [59] P. Shvaiko, J. Euzenat, A. Kementsietsidis, M. Mao, N. F. Noy, H. Stuckenschmidt (Eds.), Proc. of the 7th Int. Workshop on Ontology Matching, Boston, MA, USA, November 11, 2012, Vol. 946 of CEUR Workshop Proceedings, CEUR-WS.org, 2012. [60] J. Euzenat, An API for ontology alignment, in: S. A. McIlraith, D. Plexousakis, F. van Harmelen (Eds.), The Semantic Web - ISWC 2004: Third Int. Semantic Web Conf.,Hiroshima, Japan, November 7-11, 2004. Proc., Vol. 3298 of Lecture Notes in Computer Science, Springer, 2004, pp. 698– 712.

1690

[61] J. Euzenat, M.-E. Rosoiu, C. Trojahn, Ontology matching benchmarks: generation, stability, and discriminability, J. Web Semant. 21 (2013) 30– 48.

1692

[62] M. Guli´c, B. Vrdoljak, CroMatcher - results for OAEI 2013, in: Shvaiko et al. [27], pp. 117–122.

1694

[63] M. Gulic, B. Vrdoljak, M. Banek, CroMatcher results for OAEI 2015, in: Shvaiko et al. [29], pp. 130–135.

1696

[64] W. E. Djeddi, M. T. Khadir, Xmap++: results for OAEI 2014, in: Shvaiko et al. [28], pp. 163–169.

1698

[65] A. Khiat, M. Benaissa, AOT / AOTL results for OAEI 2014, in: Shvaiko et al. [28], pp. 113–119.

1700

[66] S. Schwichtenberg, C. Gerth, G. Engels, RSDL workbench results for OAEI 2014, in: Shvaiko et al. [28], pp. 155–162.

1702

[67] F. C. Schadd, N. Roos, Summary of the MaasMatch participation in the OAEI-2013 campaign, in: Shvaiko et al. [27], pp. 139–145.

1688

54

1704

[68] OAEI anatomy test set, http://oaei.ontologymatching.org/2015/anatomy/index.html, accessed: 2016-03-02.

1706

[69] OAEI conference test set, http://oaei.ontologymatching.org/2015/conference/index.html, accessed: 2016-03-02.

1708

[70] C. J. Mungall, C. Torniai, G. V. Gkoutos, S. E. Lewis, M. A. Haendel, Uberon, an integrative multi-species anatomy ontology, Genome Biology 13 (1) (2012) 1–20.

55