Overcoming semantic heterogeneity in spatial data infrastructures

Overcoming semantic heterogeneity in spatial data infrastructures

ARTICLE IN PRESS Computers & Geosciences 35 (2009) 739–752 Contents lists available at ScienceDirect Computers & Geosciences journal homepage: www.e...

583KB Sizes 0 Downloads 83 Views

ARTICLE IN PRESS Computers & Geosciences 35 (2009) 739–752

Contents lists available at ScienceDirect

Computers & Geosciences journal homepage: www.elsevier.com/locate/cageo

Overcoming semantic heterogeneity in spatial data infrastructures M. Lutz a,, J. Sprado b, E. Klien c, C. Schubert d, I. Christ d a

European Commission—Joint Research Centre (JRC), Via E. Fermi 1, 21027 Ispra, Italy Center for Computing Technologies (TZI), Am Fallturm 1, 28359 Bremen, Germany c ¨ nster, Germany Institute for Geoinformatics (IfGI), Weseler Straße 253, 48151 Mu d Delphi InformationsMusterManagement (DELPHI IMM), Friedrich-Ebert-Straße 8, 14467 Potsdam, Germany b

a r t i c l e in fo

abstract

Article history: Received 20 December 2005 Received in revised form 29 May 2007 Accepted 21 September 2007

In current spatial data infrastructures (SDIs), it is still often difficult to effectively exchange or re-use geographic data sets. A main reason for this is semantic heterogeneity, which occurs at different levels: at the metadata, the schema and the data content level. It is the goal of the work presented in this paper to overcome the problems caused by semantic heterogeneity on all three levels. We present a method based on ontologies and logical reasoning, which enhances the discovery, retrieval, interpretation and integration of geographic data in SDIs. Its benefits and practical use are illustrated with examples from the domains of geology and hydrology. & 2008 Elsevier Ltd. All rights reserved.

Keywords: Semantic heterogeneity Interoperability Spatial data infrastructures Ontologies

1. Introduction Spatial data infrastructures (SDIs) play a major role for searching, accessing and integrating heterogeneous geographic data sets and geographic information (GI) services. The standards of the Open Geospatial Consortium (OGC) provide a syntactical basis for data interchange between different user communities. But this is only the first step, as semantic heterogeneity (Bishr, 1998) still presents an obstacle on the way towards full interoperability (Egenhofer, 2002; Sheth, 1999; Sondheim et al., 1999). In contrast to syntax, which only defines the structure, semantics refer to the meaning of elements. In SDIs, existing standards fail to address semantic problems that occur due to heterogeneous data content and heterogeneous user communities (using different languages, terminologies and perspectives). Semantic hetero Corresponding author. Tel.: +39 0332 786759; fax: +39 0332 786325.

E-mail addresses: [email protected] (M. Lutz), [email protected] (J. Sprado), [email protected] (E. Klien), [email protected] (C. Schubert), [email protected] (I. Christ). 0098-3004/$ - see front matter & 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.cageo.2007.09.017

geneity occurs at different levels. At each of these levels, it can inhibit tasks that are essential to the success of SDIs.

 At the metadata level, semantic heterogeneity impedes the discovery of geographic information;

 at the schema level, semantic heterogeneity impedes the retrieval of geographic information; and

 at the data content level, semantic heterogeneity impedes the interpretation, integration and exchange of geographic information. It is the goal of the work presented in this paper to enhance SDIs by overcoming these problems. We present an ontology-based method for enhancing GI discovery, retrieval, interpretation and integration in SDIs, which has been developed in the meanInGs project.1 To illustrate its benefits and practical use, we introduce two examples:

 an example from the geology domain that illustrates the benefits for interpretation and integration of GI 1

See http://www.meanings.de/.

ARTICLE IN PRESS 740

M. Lutz et al. / Computers & Geosciences 35 (2009) 739–752

Table 1 Examples for different nomenclatures used in geological maps of the lower buntsandstein in Saxony-Anhalt Author(s) of classification Date Survey map

Fulda and Huelsemann 1930 Eisleben

Dockter and Puff 1959 Erdeborn

Jung 1968 Hettstedt

Radzinskia 1997

Rock description

Stratgraphic short terms

su2o su2o su2u su2u su2u su1

su3d su3’st su3’k su2 su2 su1

su5 su5 su4 su3 su2 su1

suBDS suBOW suBRG suCST suCUW zB

Dolomitic sandstone Interlaminated mixed layers Rogenstein-Zone (Oolithic limestone) Red–brown schistous clay, mudstone Fine-grained carbonate sandstone Bro¨ckelschiefer (crumbly shales)

a



Official classification currently used in the geological information system of Saxony-Anhalt.

described in different geological classification systems, and an example from the hydrology domain that demonstrates the benefits for GI discovery, retrieval and exchange in a service composition application.

The remainder of the paper is structured as follows. Section 2 elaborates on the problems caused by semantic heterogeneity at the metadata, schema and data content levels. In Section 3, we explain the building blocks employed in the proposed approach for overcoming these problems. The method for dealing with semantic heterogeneity at the data level is described in Section 4. Section 5 shows how the building blocks can be used for overcoming semantic heterogeneity at the metadata and schema levels. In both Sections 4 and 5, we show how the presented method can be encapsulated in services and client applications and how to combine these with existing SDI components. The practical use is illustrated within the scope of the geology and hydrology examples. In Section 6, we discuss the presented approach in the context of related work. Section 7 concludes with an outlook to future work. 2. Problems caused by semantic heterogeneity In this section, we illustrate the problems caused by semantic heterogeneity in two different geospatial applications. We also use these examples throughout the paper to illustrate our proposed solution. 2.1. Interpretation and integration In our first scenario, Hannah, a geologist, has to answer questions concerning the stratigraphy within SaxonyAnhalt. Stratigraphy describes the layering and the corresponding age of the rocks. Hannah’s questions might include the following: ‘‘Where are the geological conditions suitable for hosting bodies of groundwater?’’ or ‘‘Where is the geological rock suitable for a dump site?’’ For this task, Hannah has to analyse and visualise several geological data sets that are available at the Geological Survey of Saxony-Anhalt. The main challenge for answering her question lies in the interpretation of the data, i.e. at the data content level. Different authors of geological maps have used different

stratigraphic classifications at different times in history, leading to several synonymous and homonymous stratigraphic terms within the geological database. Often, even on adjacent maps, different classification systems and nomenclatures are used. Table 1 gives a few examples for the geological period of the Lower Buntsandstein (Triassic, about 250 Mio years ago). It illustrates the use of different terms (synonyms) for the same rock formation, e.g. su2u, su3’k and su4 for Oolithic limestone, as well as the use of the same term (a homonym) for different rock formations, e.g. su2 for fine-grained carbonate sandstone as well as red-brown schistous clay. In a current SDI, Hannah can use a Web Map Service (WMS, de la Beaujardiere, 2006) to represent or highlight data based on different classification systems in a common map. In order to provide an integrated view using a common classification system and symbology, Hannah has to formulate a Styled Layer Descriptor (SLD, Lalonde, 2002). For this task, she has to understand each of the specific classification systems used for the data. Only if Hannah can interpret and compare the terms, she will be able to manually translate (or re-interpret) the data in order to formulate the SLD. The metadata for data sources available in SDIs today often do not provide sufficient information on the classification systems used thus making this task very difficult. 2.2. Discovery, retrieval and exchange in service composition In our second scenario, Max, a service developer, wants to implement a web service chain (as defined in ISO 19119:20052) that provides fast and up-to-date access to water level measurements in a river, interpolates these measurements along the river course, and visualises the interpolation results. Such a service chain could, for example, enable the detection of hazard areas during flood events. In order to execute the interpolation service in an open and distributed environment, the following steps are necessary: (1) appropriate input data have to be discovered, (2) the input data have to be retrieved using a given query filter and (3) the retrieved data have to be transformed to fit the requirements of the interpolation service. 2 Available at http://www.iso.org/iso/iso_catalogue/catalogue_tc/ catalogue_detail.htm?csnumber=39890. ISO 19119:2005 is based on Percivall (2002), which is publicly available.

ARTICLE IN PRESS M. Lutz et al. / Computers & Geosciences 35 (2009) 739–752

WFS 1:

WFS 2:

STAV 200

HEIGHT 2

741

Fig. 1. Two filter expressions to retrieve all measurements with a water level greater than 2 m.

In current SDI architectures, Max will face problems during the discovery, retrieval and exchange of geospatial data, i.e. on the metadata, the schema and the data content level. For discovery, he will use a catalogue (Nebert et al., 2007) to do a keyword-based search, possibly in combination with a spatial filter. Even though natural language-processing techniques can increase the semantic relevance of search results with respect to the search request (e.g. Richardson and Smeaton, 1995), keyword-based techniques are inherently restricted by the ambiguities of natural language. If Max’s terminology differs from the terminology used by data providers, keyword-based search can have low recall, i.e. not all relevant information sources are discovered. Moreover, precision can also be low, i.e. some of the discovered services are not relevant (Bernstein and Klein, 2002). This can be the case if requesters and providers use homonymous terms or because the catalogue does not allow the requester to express complex queries. Once Max has discovered an appropriate data source, he can access it through a standardized interface like a Web Feature Service (WFS, Vretanos, 2005), which supports retrieval of features encoded using the Geographic Markup Language (GML, Portele, 2007). To formulate a retrieval request to the WFS (including filter conditions), Max has to know and understand the schema of the data source. While the service can return the structure of the schema, the meaning of some of the property names might not be intuitively interpretable for Max. For example, if he wants to retrieve all measurements with a water level greater than 2 m, he has to know the property containing the water level (which might e.g. be called height, level or stav3) and the unit of measure the data is given in (which might e.g. be centimetre, metre or feet). Depending on the feature type schema, the same request would have to be stated quite differently for different WFS, as shown with the two possible filter statements in Fig. 1. Furthermore, when the retrieved data are to be consumed by another service (e.g. in a composite service chain) they might have to be mapped from the providing service’s (source) schema into the consuming service’s (target) schema. In Max’s service chain, the results returned by the WFS may be incorrectly interpreted by the consuming interpolation service. If, e.g., the interpolation service expects water level measurements in metres and the WFS provides water level measurements in centimeters, this will lead to wrong interpolation results. Therefore, both data processing and data integration 3

‘‘stav vody’’ is the Czech term for ‘‘water level’’.

within a composite service chain require the detection and elimination of heterogeneity at the data content level, e.g. by transforming values between different units of measure. 3. Building blocks for overcoming semantic heterogeneity In this section, we introduce the building blocks needed in our approach to overcome semantic heterogeneity in geospatial applications. We describe the ontology architecture (Section 3.1), ontology language (Section 3.2) and reasoning procedures (Section 3.3) that are employed in the proposed method. The notion of registration mappings (Section 3.4) is used to establish a link between a data schema and its semantic description, which is crucial for the tasks of data retrieval and schema transformation. The rule-based method for semantic data integration (Section 3.5) is employed in our approach for detecting and eliminating semantic heterogeneity for the task of data exchange. 3.1. Ontology architecture Ontologies can be employed for making the semantics of the information content of geospatial web services explicit. They are constituted by a specific vocabulary used to describe a certain reality, plus a set of explicit assumptions regarding the intended meaning of the vocabulary words (Guarino, 1998). The backbone of our method is an infrastructure based on a hybrid ontology approach (Wache et al., 1999), which is a combination of two existing ontology approaches. The main idea is to describe each information systems with its own application ontology, as it is also done in multiple ontology approaches (Mena et al., 2000). However, in contrast to these, in the hybrid ontology approach, the concepts of each application ontology do not stand on their own, but are instead based on primitive concepts from a common shared vocabulary of the domain.4 Thus, comparability between the different application ontologies is achieved on the semantic level. A user searching for data or a category with certain properties can also use the concepts and relations from the shared vocabulary to specify a query. As both application ontologies and queries are based on the same concepts, they become 4 The term shared vocabulary should not be confused with thesauri or lexical structures that offer simple term collections. In our approach, we use the term to comprise the collection of domain ontologies used in an information community (Fig. 2).

ARTICLE IN PRESS 742

M. Lutz et al. / Computers & Geosciences 35 (2009) 739–752

INFORMATION COMMUNITY shared vocabulary domain ontology

domain ontology

domain ontology

provides basic concepts and relations for specifying

application ontology

application ontology

application ontology

...

are used for semantic annotations of

data source

data source

classification system

query

specifies

... user

Fig. 2. Hybrid ontology approach, figure adapted from Wache et al. (2001).

comparable—and thus the commitment of providers and requesters to a common shared vocabulary ensures semantic interoperability. Thus, our hybrid ontology approach offers the comparability of single ontology approaches (Arens et al., 1996) as well as the flexibility of multiple ontology approaches (Fig. 2). 3.2. Description logics The ontologies shown in this paper are expressed using a Description Logic (DL) (Baader and Nutt, 2003) notation used in the RACER system (Haarslev and Mo¨ller, 2004). DL is a family of knowledge representation languages that are subsets of first-order logic (for a mapping from DL to FOL, see e.g. Sattler et al., 2003). DL provide the basis for the Ontology Web Language (OWL), the proposed standard language for the Semantic Web (Antoniou and Van Harmelen, 2003). The basic syntactic building blocks of a DL are atomic concepts (unary predicates), atomic roles (binary predicates) and individuals (constants). The expressive power of DL languages is restricted to a small set of constructors for building complex concepts and roles. Implicit knowledge about concepts and individuals can be inferred automatically using inference procedures (Baader and Nutt, 2003). A DL knowledge base consists of a TBox containing intensional knowledge (declarations that describe general properties of concepts) and an ABox containing extensional knowledge that is specific to the individuals of the universe of discourse. In our work, we only use TBox language features, namely

The domain of a role is a concept describing the set of all individuals from which this role can originate. This notion of the term should not be confused with the notion ‘‘domain of interest’’ (as in shared domain vocabularies). The range of a role is a concept describing the set of all things the role can lead to. Concepts can be defined using the following constructors: D- *top* *bottom* (and E F) (or E F) (all R C) (some R C) (at-least|at-most|exactly n R)

(universal concept) (bottom concept) (intersection) (union) (value restriction) (existential quantification) (number restrictions)

The universal concept describes the set of all individuals in the universe of discourse. The bottom concept describes the empty set. 3.3. Subsumption reasoning Determining whether one description subsumes another one, i.e. whether the first is more general than the second, is one important reasoning task of DL systems. Formally, subsumption can be defined as follows: In a terminology T containing concepts C and D, C is subsumed by D if in every model of T the set denoted by C is a subset of the set denoted by D (Donini, 2003). With subsumption tests, the concepts of a terminology can be organised into a hierarchy according to their generality. A concept description can also be conceived as a query, describing a set of objects one is interested in (Donini, 2003). Thus, all concepts that are subsumed by the query concept can be considered to also satisfy the query. Users can apply this functionality for matchmaking, i.e. for discovering concepts that match their query. The query concept used in matchmaking can either be an existing concept from a domain or application ontology (simple query) or a concept defined by the user based on the concepts and relations in the shared vocabulary (defined concept query). For example, the concept su4 from an existing application ontology (representing a stratigraphic term in some geological classification system) could be used as a query concept in a simple query. Instead, to express a query such as ‘‘areas suitable for hosting bodies of groundwater’’ a query concept might have to be defined (if no such concept already exists in an application ontology). This new query concept (for a defined concept query) should be based on domain concepts such as consistency or layering. 3.4. Registration mappings for GI retrieval

 concept definition: (define-concept C D),

 concept inclusion: (implies C D), and  role definition: (define-primitive-role R :parent P :domain C :range D).

While subsumption reasoning allows expressive query processing that helps to improve data discovery, to support users in formulating filters for data retrieval, an explicit link is required between the data source’s schema and its application ontology. This link will ensure that the schema elements can be interpreted in terms of the

ARTICLE IN PRESS M. Lutz et al. / Computers & Geosciences 35 (2009) 739–752

743

859015.7375721685,5624676.8195826 LABE 151 2003-11-02T07:00:00

Structural Path

/StavVody /StavVody/gml:position/gml:Point /StavVody/tok /StavVody/stav /StavVody/datum

Conceptual Path

↔ wfs1_Measurement ↔ wfs1_Measurement.location ↔ wfs1_Measurement.quantityResult.observedWaterBody.name ↔ wfs1_Measurement.chmi_qRWaterLevel.value ↔ wfs1_Measurement.timeStamp

Fig. 3. Example registration mapping for the GML document shown on top.

shared vocabulary (see Section 3.1). In order to establish this link, we use registration mappings as introduced in Bowers et al. (2004). The main idea of registration mappings is to have separate descriptions of the application concept C and of the structural details of the feature type it describes. This has the advantage that the semantics of the feature type can be specified more accurately in application concepts because the specification does not try to mirror the feature type’s structure. This is especially true for feature types that do not well reflect the conceptual model of the domain. An example registration mapping5 for a feature type representing water level measurements in rivers is shown in Fig. 3. The feature type is mapped to an application concept (wfs1_Measurement) and its properties are mapped to a contextual path in the (application) ontology. A contextual path denotes a concept, possibly within the context of other concepts. It takes the form C.r1.r2. y .rn for nX0, where C is a DL concept and r1 to rn are DL properties. For example, the contextual path wfs1_Measurement.quantityResult.observedWaterBody refers to the water body, in which a (wfs1) measurement was taken. Note that, when using registration mappings, we always assume that every property of a feature type can be mapped to a contextual path in the ontology. 3.5. Semantic mediation for GI service composition The matchmaking approach described in Section 3.3 can be used to identify data sources that exactly match the semantics required by the requester. By introducing a mechanism for data integration between the source and the target schema, also more relaxed types of match become possible. 5

Taken from the hydrology example described in Section 4.

As we want to transform data on-the-fly, we use a nonmaterialized data integration approach using a mediator architecture (Wiederhold, 1992). In detail, our approach is based on the idea of semantic mediation introduced in Wache (2003). When compared with other approaches from the area of schema integration (for an overview, see Conrad 2002; Rahm and Bernstein, 2001), semantic mediation specifically focuses on semantic heterogeneity (Goh et al., 1999). When developing a mediator for integrating heterogeneous information sources, the specification of the integration mappings on the semantic level is considered to be the most crucial task. As described in Section 3.1, we use application ontologies to semantically describe an information source. Following Wache (2003), we further define a semantic description to consist of two parts: a description of the meaning and a set of context attributes. Analogously to Wittgenstein (1953), Wache claims that meaning is determined by use in the language and a context of a given situation, respectively. Thus, Wache defines the meaning on a meta-level unanimously across all data sources in a domain e.g. by a (meta-) concept WaterLevel. It can be used to automatically detect semantically equivalent information or semantic heterogeneity within one domain. The context attributes describe the different encodings of semantically equivalent information in different data sources, e.g. the unit or the scale of WaterLevel. Thus, we are able to solve semantic heterogeneity problems between specific data sources within one domain and also simplify the process of identifying semantically equivalent information while considering different representations of data. The integration mapping, which provides the basis for the data exchange task, is specified using a rule-based approach. The mapping consists of context transformation rules that specify how a piece of information can be transformed from one context into another. Table 2 shows

ARTICLE IN PRESS 744

M. Lutz et al. / Computers & Geosciences 35 (2009) 739–752

Table 2 Context transformation rule

Attribute name Meaning Context Value

Source information

Target information

Transformation operation

?att1 ?m UnitOfMeasure ¼ Centimeter ?v1

?att2 ?m UnitOfMeasure ¼ Meter ?v2

?v2 ¼ ?v1/100

Variables are denoted by a prefixed question mark to distinguish them from constants.

schematically a simple example of a context transformation rule. The rule specifies how to transform a measurement (i.e. a piece of information) in metres (source context) into a measurement in centimeters (target context). More precisely, it states that when moving from the context unitOfMeasure ¼ Centimeter to the context unitOfMeasure ¼ Meter, the value ?v2 of the attribute ?att2 can be derived by dividing the value ?v1 of attribute ?att1 by 100 provided the meaning ?m of both attributes is the same. 4. Enhancing GI interpretation and integration Often, values in geographic datasets are encoded using some kind of classification system. Examples for such values include soil types, land cover types and geological ages or formations. A class or type in such a classification system can usually be described by a number of defining characteristics. We aim at overcoming some of the semantic heterogeneity on the data level by making the meaning of terms from different classification systems explicit. This should ease the interpretation of data values from unknown data sources as well as the integration of several data sources that use different classification systems. We use the building blocks presented in Section 3 to make the meaning of terms used in classification systems explicit and comparable. More precisely, we use DL concepts to represent the classification system terms and subsumption reasoning to establish whether a term in one classification is a specialisation of a term in another classification or to find terms in different classification systems that are all specialisations of a user query. In Section 4.1, we illustrate this method using the geology example introduced in Section 2. In Section 4.2, we present an SDI implementation for this example. 4.1. Enhancing GI interpretation and integration in the geology example In our scenario, Hannah has to analyse data sets based on different stratigraphic classification systems that are unknown to her. Also, she needs to find areas with the same petrographic characteristics in several adjacent geological maps that use different classification systems. The basis for enhancing GI interpretation and integration in this scenario is a common shared vocabulary for a sub-domain of geology. The domain ontology describes the petrographic properties of the stratigraphic layers, e.g. the consistency of the rock, its major and minor

petrographic components, and their grain size6 and layering. Fig. 4 gives a simplified overview of the concepts and properties in the domain ontology. Based on the domain ontology, more specific application ontologies are defined for each classification system. For example, the stratigraphic terms su1 and su3 (from the Classification of Jung, cf. Table 1) both describe rock types with hard consistency and silt as their main component; su1 further contains sand and lime (with a particular layering) as other components, while su3 does not have any other components. The application concepts for these descriptions, i.e. its formalisation based on the domain vocabulary, is given below. (define-concept su1 (and ClasticSediment (all hasConsistency Solid) (exactly 1 hasMajorComponent) (all hasMajorComponent (all hasGrainSize SiltSize)) (exactly 2 hasMinorComponent) (exactly 1 hasMinorComponent (all hasGrainSize SandSize)) (exactly 1 hasMinorComponent (and Lime (exactly 1 isLayered) (all isLayered Banks))))) (define-concept su3 (and ClasticSediment (all hasConsistency Solid) (exactly 1 hasMajorComponent) (all hasMajorComponent (all hasGrainSize SiltSize)) (exactly 0 hasMinorComponent)))

As all descriptions from different classification systems are grounded in the common shared vocabulary, they become comparable, and Hannah can use subsumption reasoning to find concepts matching a specific query concept. She can do her search either as a simple query or as a defined concept query (Section 3.3). For example, if she wants to highlight areas in a map that are classified as su4 (according to the Classification of Jung), she can do a simple query for this concept. If she is interested in ‘‘areas suitable for dump sites’’, she can define a new query concept that defines the characteristics of a rock offering a good protection against ground water pollution. Translated in petrographic characteristics this means it must be 6 We use the concepts SandSize, SiltSize, etc. to refer to grain size classes (e.g. 0.063–2 mm for SandSize) that characterize material of that grain size (i.e. sand, silt, etc.).

ARTICLE IN PRESS M. Lutz et al. / Computers & Geosciences 35 (2009) 739–752

745

Genesis

Consistency hasConsistency

Solid

hasGenesis

Aeolic

Granular

Glacial

ClasticSediment

...

hasMinorComponent hasMajorComponent 1..3

0..*

Layering isLayered

Component hasGrainSize

Banks GrainSize

Carbonate

Silicate

Humus

Ooide

...

...

Concept

GravelSize

ClaySize

SiltSize

Property

SandSize

Generalisation

1..3

Cardinality

Fig. 4. Concepts and relations of the domain ontology. For simplicity, only some of the concepts and relations are shown.

1

3

define query concept

2

create SLD

Client Workflow Service (WS Client)

find matching concepts based on subsumption reasoning

4

request map (based on SLD)

5 Ontology-based Reasoner (OBR)

Web Map Service (WMS)

request features

Web Feature Service (WFS)

Fig. 5. SDI architecture for GI interpretation and integration.

a solid rock or a soil based on clay or silt. This can be translated into the following DL query concept: (define-concept query (and KlasticSediment (or (and (all hasConsistency Solid) (exactly 1 hasMajorComponent)) (and (all hasConsistency Granular) (exactly 1 hasMajorComponent) (all hasMajorComponent (all hasGrainsize (or ClaySize SiltSize)))))))

This query concept can be used in a DL reasoner to find more specific terms in each of the application ontologies that describe stratigraphic terms in the different classification systems. In our scenario, subsumption reasoning returns (among others) the concepts su1 and su3 from the application ontology representing the classification system by Jung (see definitions above). Hannah can use these terms to create the SLD for selection and visualisation of the data sets described using this classification system. Of course, she can also repeat this procedure for further application ontologies associated with other classification systems.

ARTICLE IN PRESS 746

M. Lutz et al. / Computers & Geosciences 35 (2009) 739–752

4.2. SDI implementation In order to use the methods for enhancing GI interpretation and integration in SDIs, they have to be encapsulated in software components. In this section, we introduce an architecture that includes such components in addition to existing SDI components and describe the flow of information between them. Fig. 5 depicts the architecture including the tasks fulfilled by each of the components. The central component of the architecture is a client that manages the overall workflow (WS Client). In our current architecture, this service is tailored to its specific application, i.e. to highlight areas in adjacent geological maps given a specific user query. In a future version of the architecture, this client could be substituted by a generic workflow service that executes a customised description of the service chain to be executed, e.g. using the Business Process Execution Language (BPEL) (Andrews et al., 2003). The user interface provided by the WS Client is the entry point of the application (step 1 in Fig. 5). The user can either choose a query concept from an existing application ontology or define one’s own query concept based on the domain ontology. In order to make the use of domain ontologies transparent for the user, we have implemented a function for translating user queries into DL query concepts, which can subsequently be used by an inference engine to do the actual matchmaking. The query concept is sent to the Ontology-Based Reasoner (OBR), a component that stores the registered ontologies and provides the reasoning functionality (step 2). Based on subsumption reasoning the OBR discovers semantically matching terms in the different classification systems and returns them to the WS Client. In the next step, these concepts are used to generate an SLD including a GetFeature request to the WFS that provides a standardized access to the geological database (step 3). This SLD is included in a request to a WMS (step 4), and the retrieved features (step 5) are displayed in a map. As a result of the user-defined query, Fig. 6 shows the

successful data interpretation and integration in terms of a combined map from two neighbouring map sheets in Saxony-Anhalt. Each map sheet is based on different classification system. 5. Enhancing GI discovery, retrieval and exchange Before being able to interpret and/or integrate data sources as described in Section 4, they need to be discovered and retrieved. In a service chain, data also has to be exchanged between connected component services. These tasks are often hindered by semantic heterogeneity on the metadata and schema levels (Section 2). Also on these levels, the building blocks introduced in Section 3 can be used for enhancing the discovery, retrieval and exchange of geographic information. In Lutz and Klien (2006), we have proposed an integrated approach for GI discovery and retrieval based on a specific user query. In this paper, we adapt and extend this approach to situations where the retrieved data are to be consumed by another service (e.g. in a composite service chain). In these cases, the user query is replaced by the requirements of the consuming service. Furthermore, an additional step might be required after discovering and retrieving appropriate data. If the structure and semantics of the data provided by one service do not match exactly those required by the consuming service, a transformation becomes necessary. In general, transformations connect one or more data sources to a destination with the help of appropriate conversion rules. The main challenge here is not to process the transformation, but rather to discover and to specify it. A prerequisite for the presented approach is the availability of shared vocabularies for the domain of interest. The available information sources have to be semantically annotated with application ontologies (Section 3.1) that use registration mappings (Section 3.4), and the input of the consuming service also needs to be described through a DL concept and a registration mapping. The DL concept describing the service input is used as a query concept for discovering semantically

Section of topographic survey map GK 25: 4335 Hettstedt Date: 1962 Author: JUNG

Section of topographic survey map GK 25: 4435 Eisleben Date: 1929 Authors: FULDA & HUELSEMANN

Fig. 6. Result for rock types offering a good protection against ground water pollution.

ARTICLE IN PRESS M. Lutz et al. / Computers & Geosciences 35 (2009) 739–752

appropriate feature types. The query concept might already take into account that a transformation is possible for certain characteristics and hence relax some of the restrictions used in the approach presented in Lutz and Klien (2006). The matchmaking between the query concept and the application concepts describing feature types is based on subsumption reasoning (Section 3.3). After the requester has selected one of the discovered feature types, a query, which uses the property names of the feature type’s application schema, is constructed from the user query. This step requires registration mappings (Section 3.4) between the feature type’s properties and the roles from the domain ontology. Step by step instructions for deriving the request are given in Lutz and Klien (2006). The derived query is then executed and its results are used as input for the consuming service. In most cases, the retrieved data will not fit the requirements of the consuming service directly, and a transformation is necessary. For that purpose, we use the approach of semantic mediation introduced in Section 3.5. In detail, we analyse the correspondences between the properties of the feature types discovered in the previous step. These correspondences are established by referring to the properties’ contextual characteristics. After detecting contextual heterogeneity, at least one context transformation rule has to be acquired to solve this drawback. Ideally, context transformation rules given by a function library will be used. In Section 5.1, we illustrate this approach using the hydrology example introduced in Section 2. In Section 5.2, we present an SDI implementation for this example. 5.1. Enhancing GI discovery, retrieval and exchange in the hydrology example In our example, Max is interested in water level measurements for the Elbe River for a given date (22 April 2004) and location, which he wants to use as input for an interpolation service. In the following, we illustrate how to build a query statement based on Max’s requirements (1), how to turn this query into a DL query concept (2) and subsequently into a GetFeature request (3), and finally how to derive the transformation rules required in order to use the data as input for the interpolation service (4). (1) Fig. 7 presents a query statement reflecting Max’s requirements. The query follows the syntax for an SQL-like query language proposed in Lutz and Klien (2006), which is to enable users to intuitively select properties of specific feature types, possibly using

SELECT quantityResult FROM Measurement WHERE quantityResult hasType ( observable hasType WaterLevel AND unitOfMeasure hasType Meter AND observedWaterBody hasType (name = “Elbe”)) AND dateStamp = 2004-04-22 AND hasLocation isWithinBoundingBox (11,52,13,54) Fig. 7. Example for a semantic query statement. Keywords of proposed syntax are shown in capitals, comparators in italics.

747

one or several constraints. Properties correspond to relations in the shared vocabulary, while feature types correspond to concepts. The constraints are either type restrictions (e.g. observable hasType WaterLevel) or value constraints, i.e. comparisons with a value specified by the requester (e.g. dateStamp ¼ 2004-04-22). The query statement also includes a requirement of the interpolation service, which expects values to be given in metres (unitOfMeasure hasType Meter). (2) This query can be translated following the guidelines in Lutz and Klien (2006) into the following DL query concept. Type constraints are expressed through universally quantified value restrictions in DL. Value constraints only become relevant when the data are retrieved from the WFS. In order to be able to express these constraints as a filter expression in the GetFeature query, it is important that the feature type contains the property to be constrained. Therefore, in the discovery phase, value constraints are expressed as existential quantification on the specified roles. In this step, possible transformations could already be taken into account. If, for example, a transformation rule between the units metre and centimetre exists, it makes sense to search for feature types that offer water level measurements in both units, thus extending the potential result set. In order to relax the query concept accordingly, the range restriction for the unitOfMeasure role is relaxed from Meter to the disjunction of unit concepts that are transformable into centimeters: (or Centimeter Meter Inch y)). Note that now it is no longer guaranteed that the discovered feature types exactly match the requirements of the interpolation service. However, it is guaranteed that all feature types can be transformed in the correct schema using a transformation rule. (define-concept query (and Measurement (some quantityResult (all observable WaterLevel) (all unitOfMeasure (or Centimeter Meter Inch y)) (some observedWaterBody (some name *top*))) (some dateStamp *top*) (some hasLocation *top*)

9 > > > > > = type constraints > > > > > ; 9 > > > = value > > > ; constraints

)

This query concept is then used for discovering appropriate data sources based on DL subsumption reasoning (Section 3.3). (3) In the next step Max wants to retrieve the requested information from the discovered data sources. As we assume an SDI setting, this means formulating a GetFeature request including a filter expression to the WFS serving the data. In order to do this, the structure of the WFS’s feature type and the names of its attributes have to be known. All the required information can be accessed from the feature type’s

ARTICLE IN PRESS 748

M. Lutz et al. / Computers & Geosciences 35 (2009) 739–752

stav StavVody/tok Elbe StavVody/datum 2004-04-22 gml:position