Environmental Modelling & Software 24 (2009) 1153–1162
Contents lists available at ScienceDirect
Environmental Modelling & Software journal homepage: www.elsevier.com/locate/envsoft
XQuery as a retrieval mechanism for longitudinal multiscale forest resource data Jussi Rasinma¨ki* Department of Forest Resource Management, University of Helsinki, P.O. Box 27, 00014 University of Helsinki, Finland
a r t i c l e i n f o
a b s t r a c t
Article history: Received 5 August 2008 Received in revised form 3 December 2008 Accepted 4 March 2009 Available online 18 April 2009
In most forested environments, boreal forest in particular, monitoring and evaluation of management activities would be facilitated by storage of multiscale data sets spanning long time periods. For this study, such a data repository based on a generic spatio-temporal conceptual data model was created using XML. The data repository containing changes in data content for a 1000 ha forest estate between 1871 and 1994 was used to evaluate the data retrieval capabilities of XQuery. The data retrieval operations consisted of pivoting, sequencing, and aggregation. Pivoting transforms the generic attribute storage used in the data model into a more compact column-wise format. Sequencing is used to return a temporal sequence of values, and for the traversal of the object hierarchy. Aggregation aggregates the values of a variable over a grouping variable. The combinations of the basic retrieval operations were also investigated as well as the effect of time and object hierarchy on the retrieval operations. The conceptual match between the generic spatio-temporal data model and the XML data model was apparent at the data retrieval operations. The key properties of XQuery for the implementation of the retrieval operations were extendability, support for nested structures, and data transformation functionality. Ó 2009 Elsevier Ltd. All rights reserved.
Keywords: Generic data model Hierarchical Temporal Data access methods Forest inventory
1. Introduction The focus of forest resource management has shifted from management for timber yield to ecosystem management, which can be operationalised with adaptive management consisting of a cycle of planning, actions, monitoring, and evaluation. This cycle facilitates integration of new goals, knowledge and technology into the management (Reynolds, 2005). In most forested environments, boreal forest in particular, there is a considerable time lag between evaluation and planning, as well as implementation and monitoring, and thus the temporal dimension has a particular role in data management. To illustrate this point, the current recommendations for silvicultural practises in Finland, part of the boreal vegetation zone, have guidelines for the rotation age of a single tree generation ranging from 50 to 150 years for different tree species (Anon, 2006). The long time frame usually means changes in data content and reporting needs within an organization. Changes in field measurements, for instance, normally require changes in the data model. Since the changes could otherwise prevent the completion of queries in a database (Kim and Seo, 1991), extensive reconstruction of the database that implements the model is needed. This can be disruptive both for monitoring the effects of forest
* Tel.: þ358 400 382364; fax: þ358 9 19158159. E-mail address: jussi.rasinmaki@helsinki.fi 1364-8152/$ – see front matter Ó 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.envsoft.2009.03.004
management and evaluating the results of the previous planning and implementation cycle; a change in the conceptual data model is a major drawback from the longitudinal analysis viewpoint. The changes in the structure of the database can be minimised by generalising the data model used. A generalised model can assimilate changes in the data, be it objects being analysed, or specific properties being measured, to the degree that the change does not initialise a rewrite of the data model. This is a case study of the implications using such a generalized data model has for the underlying function for any kind of analysis; data retrieval from the data store implementing the data model. The implications are demonstrated in the context of forest resource data. 2. Data management that supports longitudinal multiscale forest resource data analysis 2.1. The data model The general conceptual data model for spatio-temporal environmental data management used in this study is described in detail by Rasinma¨ki (2003). The main concepts of the model are presented below. The model has two main devices to achieve the level of generalisation that would allow it to be used in the storage of long time frame data sets. The data model defines only one basic entity that has a lifetime, within which it has locations and attributes that have their own lifetime. To match the conceptual view of forest at any given time, various classes of the basic entity can be
¨ ki / Environmental Modelling & Software 24 (2009) 1153–1162 J. Rasinma
1154
created; e.g., the forest can be modelled as forest stands which are collections of trees stratified by tree species. The result is a threelevel hierarchy of forest stand, tree species stratum and tree entity classes. However, each of these three classes are represented by the same object class in the data model, the only difference between the entity classes being that their object instances have different semantical references, e.g. trees pointing to the reference of a ‘‘tree’’ in the descriptive semantics class (Winston, 1984). The conceptual data model explicitly stores the spatial hierarchy and temporal succession of entities, each entity being aware of its predecessors and successors and the entities that are higher up and lower down in the containment hierarchy (Fig. 1). The second main device for achieving a generalised data model is the entity–value–value interpretation model used to store the attributes of entities. Specifying a relation in which all values are accompanied by the value interpretation allows changes in the stored variables without a change in the underlying data model as each spatio-temporal entity only has three types of attributes defined; its location, categorical and numerical observations of the properties of the entity (Fig. 2). There can be any number of these two types of observations at any given moment, and they are expressed as variable-value pairs. Each value interpretation is expressed using descriptive semantics in the common semantics class. This kind of variable-value association has its roots in the communication messages and semantic nets of artificial intelligence (Winston, 1984), and extensions of the basic relational data model (Codd, 1979).
i.e., the traversal of the object history, and returns the predecessors of an object with their data. Sequencing may also be applied to the traversal of the object hierarchy from the top level objects to their sub-objects. Aggregation is used to aggregate the values of a variable over a grouping variable. These basic data retrieval operations can be extended to include the temporal dimension; e.g., pivoting returns the data for a particular year in a column format, sequencing returns the object history limited to a particular time window, and aggregation is performed for a particular year. As an example of extending the data retrieval operations by combining the basic operations, sequencing and pivoting were used in the same query to retrieve the object history in column based form. 2.3. Implementation alternatives Although the data model generalisation allows for the storage of multiscale, longitudinal data sets and is able to accommodate the changes in the data content over time (Rasinma¨ki, 2003), the retrieval of data from a data source implementing the data model may pose problems. Most of the data management in current information technology infrastructure is accomplished with relational database management systems (RDBMS). However, the relational algebra behind an RDBMS assumes attribute data for an object presented in a relation, in which all the attributes would form a tuple, i.e., a row in a database table, instead of a group of tuples in the attribute–value form. Another intrinsic difficulty in implementing the data model in a relational database is the recursive nature of both the object history and the object hierarchy. The standard relational algebra data retrieval language, SQL, has only recently been extended by support for recursive queries (Eisenberg and Melton, 1999), but this standard is not consistently implemented in the current relational database management systems. To illustrate the inherent problems SQL has for retrieving data from the generic data model, a brief example involving attribute value pivoting (see Fig. 3) is given. Taken apart from the rest of the data model, the attribute storage could be implemented as a relation ATTR(object_id, value, variable, begin_date, end_date, true_obs, current_obs). This one-attribute-per-row model can be pivoted to the more concise column format using nested queries which each return one attribute to the overall result set. The nested sub-queries should be combined using outer joins in order to include objects with possible null values in the set of results. Listing 1 provides an example of the required nested query structure for pivoting two attributes. Each added attribute would deepen the nesting with one
2.2. The requirements The data retrieval from a data store implementing the generic conceptual data model must support the traversal of the object hierarchy, and the object history. It must also accommodate the model of storing the object attributes as the value–value interpretations pairs. These requirements can be met using three operations: pivoting, sequencing, and aggregation (Johnson and Chatziantoniou, 1999). Pivoting transforms the generic object–attribute–value schema to a data content specific schema; i.e., the variable attributes in the row based model are converted into column names in the column based model (Fig. 3). This enables access to the attribute data in a more compact column-wise format, which is also the model supported by most data analysis tools, beginning from spreadsheets. Sequencing is used to return a temporal sequence of values,
predecessor
superfeature subfeature
* *
*
*
successor
semantics
ST_entity
*
stand
stratum
1
name:String semantic_category:String meaning:String
tree
Fig. 1. The main concepts of the spatio-temporal data model: a single entity type (ST_entity) is used for all data content. Different types of entities are defined in a semantics class. The spatio-temporal entities are connected to each other both in space (sub- and superfeature) and time (predecessor, successor).
¨ ki / Environmental Modelling & Software 24 (2009) 1153–1162 J. Rasinma
1155
ST_entity valid_period:Period 1
1
* shape
1
*
* numerical_obs
categorical_obs date_obs:Date
geometry:Geometry valid_period:Period
value:Float date_obs:Date
*
1 categorical_value
*
value:Int value_descr:String 1
*
1
semantics name:String semantic_category:String meaning:String Fig. 2. The principle of attribute storage for spatio-temporal entities. The ability to store attribute content that changes over time is gained by storing the attribute values together with the value interpretation (semantics class).
additional level thus making large pivot table queries close to intractable and very difficult to maintain. Moreover to combine both types of operations; pivoting for attribute data access using correlated sub-queries and recursion for sequencing; into a single query would be very cumbersome. Although not feasible in its standard format, SQL could also be enhanced to support the data retrieval operations for the generic data model by defining new relational operators that would simplify the query syntax and efficient algorithms to implement the new operators. Several such operators and algorithms have been suggested, especially in a data-mining and warehousing context (Chatziantoniou, 1999; Chatziantoniou and Ross, 1996; Johnson and Chatziantoniou, 1999; Nestorov and Tsur, 1999; Ross et al., 1998). An alternative solution would be a different physical data storage design (Datta et al., 1998). There are also proposals for solving the problems which occur when the data model changes; e.g., higher order expressions for queries where variables can be either data or metadata, such as database names (Krishnamurthy et al., 1991), but none of these are widely available in standard database management systems. The support for nested structures (Codd, 1979; Roth et al., 1988; Jaeschke and Schek, 1982), however, is starting to make its way into the implementations of the relational model, paving the way to better support of hierarchical data models. For an example of an implementation of the generic data model using an extended relational database, in that case an object-relational database management system, see Rasinma¨ki (2003).
Besides an extended relational database management system an object database and a document database for structured documents could be used to implement a data store that better supports the characteristics of the generic data model. The object database is usually tightly coupled with the programmatic representation of the data model. There are also new database structures emerging that promote modularity and configurability over the fixed relational data model. These would also allow for closely implementing the nested conceptual data model at the logical level (Seltzer, 2005), but would also require programmatic implementation of data retrieval operations (Chaudhuri and Weikum, 2000). Another problem with this class of solutions is that there is no standard to base the implementation on. For structured documents, however, such standards exist. Hierarchical Data Format (HDF) (Folk et al., 1999) conforms to an accepted standard, supports metadata augmented hierarchical data storage, and can handle a variety of data types. It also provides an application interface mechanism, is accessible on the Internet, and has an established user base. However, the main weakness of HDF is its lack of a sophisticated data query interface (Duane et al., 2000). There are non-standard implementations of a common data access interface on top of the data specific access methods provided by the HDF API (Duane et al., 2000; Alted and Ferna´ndez-Alonso, 2003) which could serve as a basis for implementing data query mechanisms for pivoting, sequencing and aggregation. Another option is Extensible Markup Language (XML) which describes data objects called XML documents and partially
Object Attribute Value 1 species
1
1 diameter
10
1 height
12
1 volume
175
2 species
2
2 diameter
23
2 height
22
2 volume
345
Object species diameter height volume 1
1
10
12
175
2
2
23
22
345
Fig. 3. The principle of pivoting operation used to convert the row-oriented data model used in the spatio-temporal data model into a column-oriented one.
¨ ki / Environmental Modelling & Software 24 (2009) 1153–1162 J. Rasinma
1156
describes the behaviour of computer programs that process them (Bray et al., 2004). The XML documents consist of elements that may be nested, i.e., they form a tree. Thus hierarchical data are well suited to representation as an XML document. The design goals for XML also included support for a wide variety of applications, human legibility and easy creation. In addition to the properties of the data content definition, which match those of the generic data model, the XML family of standards also includes one for data retrieval. The aim of the study was to determine whether XML and its data retrieval language, XQuery (Boag et al., 2005), could be used to implement a data storage for forest resource data that would be able to provide storage and retrieval for hierarchical forest resource data spanning a period of time using the data model described above. 3. Material and methods The material used in the study consisted of data from nine forest inventories of a 1000 hectare forest estate in Southern Finland (Table 1). The corresponding forest stands from consecutive inventories were manually linked based on the attribute information and spatial overlay. The inventory data were entered in an XML database where each forest stand was stored using the structure matching the spatiotemporal conceptual data model (Rasinma¨ki, 2003). (Fig. 4). The structure of the XML database can be enforced using XML schema definitions; e.g. XML Schema (Fallside and Walmsley, 2004). The temporal information was stored using two mechanisms. Each object property had an associated time period for which the property was valid, and in addition the temporal sequence of objects was expressed as lists of object ids for the ancestors and descendants of an object. Although the data cover the same geographical area at different years, there was a lot of variation as to how many stands it was considered to consist of and especially the number of attributes recorded for each stand (Table 1). This time varying data content was stored as value–value interpretation-pairs avoiding the need for a fixed definition of the data content. The cost of this flexibility is that the data content at any given time for any given object cannot be enforced through schema definitions as the data value interpretation is itself in data role as well. The XML implementation of the data model provided two options for modelling the multiscale properties of data. It was possible to model the spatial hierarchy of objects implicitly utilising the hierarchical structure of XML. The top level object of the type ‘‘stand’’ may have ‘‘stratum’’ sub-objects that are stored inside the subobjects element of the stand object. Again the strata objects may have ‘‘tree’’ subobjects stored in their sub-objects element (Fig. 5a). A more fine-grained approach to the object hierarchy was obtainable by implementing it analogously to the temporal implementation, in which the spatial ordering of objects is expressed as lists of ids of super and sub level objects (Fig. 5b). Another benefit is that the certainty of the relationship can be directly modelled by giving an attribute for the object link which describes the degree of certainty of the association among the objects. Similarly one to many relationships among objects can be modelled by allowing more than one sub- or super-object ids for any object level. The definition of the temporal and scale dependent relationships with object linking and the storage of the object attribute definitions together with the actual data values lead to a very generic conceptual data model for storing data at different scales over time. In this particular case the data contained only a single data object level of stands, but it should be noted that because of the linking mechanism the data hierarchy could by arbitrarily deep.
4. Results 4.1. Basic data retrieval operations using XQuery The pivoting query (Listing 2) demonstrates the basic structure of an XQuery. The task of the query is to return the main class, sub Table 1 Inventory data characteristics. The data cover the same geographical area. Year
Stands
Attributes
1871 1907 1935 1950 1962 1976 1984 1994
495 622 417 413 491 449 420 409
27 33 23 15 36 37 32 71
class, age and basal area attributes for a single stand having the id ‘1994_226.2’. The sub-structure of a hierarchical element is accessed using XPath expressions (Berglund et al., 2005), such as the numerical observations of a stand (Fig. 4) using the expression $stand/numerical_obs_part/n_obs, where the $stand variable refers to an ST_object element in the data. New elements are constructed by enclosing an XPath expression inside {} brackets. The For-LetWhere-Order-Return expression (FLWOR) is used to control the behaviour of the query. The return clause accompanied by either a for or let clause is the mandatory part in the FLWOR expression, whereas where and order clauses are optional. In the pivot example for each stand in the existing stand database (collection(’test.dbxml’)//ST_object), two variable assignments are made (let clause), a condition is evaluated (where clause), and results are returned if the condition is true (return clause). As can be seen from the nesting of further FLWOR expressions inside the return clause, the hierarchical structure of XML extends to XQuery as well. On the top level the $c_attrs variable is defined to include all categorical observations for a stand. In the nested subquery each observation is then processed individually (for $c_attr in $c_attrs) and the attribute name and value are returned given a condition. To present the result of the query in a more human accessible form, it can be returned as an HTML table by embedding the HTML tags needed in the return clauses of the XQuery (Fig. 6). However, this simplified version of the query relies on complete data coverage. To return a valid HTML table in the case of missing values, explicit handling of missing values should be added to the query. The sequencing query (Listing 3) utilises the extendibility feature of XQuery. The task for the query is to return all the predecessors of a stand; i.e., the history of the stand from the time predating the stand. The history contains the predating stands and all of their attributes. User defined functions can form part of a query and these functions can be recursive as is the case in the sequencing query. The recursion is used to traverse the stand history and return all the ancestors of a stand with their data. The past-function returns the history in a nested form, each ancestor being a child element of its successor stand. This nesting is flattened in the main query using the XPath expression $stands//info in the for clause to iterate over all of the info-elements regardless of their nesting level. The main hurdle for aggregation queries in Xquery is that the FLWOR expression in XQuery lacks the specific Group by clause. Grouping is needed in aggregation to express the attributes over which the data attribute is aggregated; e.g., ‘‘find the average hectare volume in each stand development class (i.e. tree size class)’’ (Listing 4). In XQuery the value aggregation for a specified variable over a grouping variable can be achieved using a combination of two for clauses and a where clause (Listing 4). The first for clause establishes the data set to be aggregated. The second for clause together with the let and where clauses iterate over the distinct values of the grouping variable for which data variable values exist. 4.2. Combinations of the retrieval operations XQuery is a declarative language in which all the queries are expressions. Hence a combination of different data retrieval operations translates to combining different expressions together as a new expression. In Listing 5, the sequencing operation is first used for the stand in question to return the set of ancestor stands, complete with all their data content. This collection of data nodes is then further processed by the pivoting operation to return specific stand attributes in column-wise format for every stand belonging to the lineage of the queried stand. The result is a complete history of a subset of the stand attributes for a stand formatted in such a way that it’s easier to transfer for further analysis.
¨ ki / Environmental Modelling & Software 24 (2009) 1153–1162 J. Rasinma
1157
Fig. 4. An example illustrating how a row-oriented data set having a data content fixed to the particular inventory is converted into the XML implementation of the conceptual data model: stand observations between different time instants are linked together, and the attributes are stored as value–value interpretation-pairs.
4.3. The effect of time on the retrieval operations For longitudinal data the basic retrieval operations should also be applicable with time constraints. Listings 5–7 demonstrate how the temporal constraints at the user defined resolution can be
a
embedded into the basic pivoting query. First, a temporal constraint function is defined (Listing 6). In this particular example the temporal resolution is set at one year, but it could be defined to be coarser or more fine-grained using the user defined function extensions of XQuery.
b
Fig. 5. Two options, each with their own implications for data retrieval, for storing nested objects at different scales: a) utilising the nesting properties of an XML document, b) explicitly coding the nesting using object identifiers. The stand is the top level object, which contains any number of tree species and size strata. The strata contain trees.
1158
¨ ki / Environmental Modelling & Software 24 (2009) 1153–1162 J. Rasinma
Fig. 6. Code fragments of a pivoting query that returns the query result as an HTML table. The XQuery return clause consists of a mixture of HTML-code for output formatting and the actual pivot XQuery, which returns four attribute values for three stands.
The temporal constraint function is then used in a function that limits the query results to the desired attributes from the given year (Listing 7). In the third step, the filtered attribute values are pivoted to a column-wise format (Listing 8). The stand history traversal is a temporal operation by definition, but using the past-function defined in Listing 3, a further temporal constraint can be set to stop the predecessor iteration once a particular year has been passed. The temporal constraint for an aggregation operation can be set identically to the pivoting operation by filtering the data set being aggregated. In Listing 9 a mechanism for using several grouping variables is explored. The definition of the filter-by-year-categorical function is similar to the numerical observation filtering function definition in Listing 7. The query returns the average age grouped by the development class of a stand and the site type; i.e., growing conditions; for a given year.
for $tree in doc(’data.xml’)/ST_object[id ¼ ’1994_1’]/subobjects/ST_object/subobjects/ST_object As seen in Listing 5, parts of the data tree can be passed from expression to expression, which enables embedding of object hierarchy directly into the data retrieval operations and their combinations. For example, a query averaging the diameter of trees by tree species for each stand in the ancestry of a specific stand requires extending the return set of the past-function (Listing 3) with the
element, otherwise it is a combination of sequencing and aggregation operations already discussed. The other option for implementing the data hierarchy is to utilise object identifiers to reference the super and sub level objects of any object in the hierarchy. That approach is analogous to the implementation of temporal lineage, and the traversal of the object hierarchy would in this case be equal to the sequencing operation used in the temporal case.
4.4. The effect of the object hierarchy on the retrieval operations
5. Discussion
When the object hierarchy is modelled using nesting in the XML document structure, all the previous data retrieval operations are applicable to any level of the object hierarchy by changing the XPath expression in the for expression; i.e., navigating directly to the desired level in the tree-like data structure. In the case of stand– stratum–tree hierarchy, the stratum level objects of a specific stand are accessible with
The main benefit of the employed entity–attribute–value data model is its ability to accommodate change: the model is able to assimilate changes in the objects that are measured, in their relationships, and in the data content for each object. In addition to the proposed use for forestry data (Tokola et al., 1997; Rasinma¨ki, 2003) the model has been applied to clinical data where adaptability to changing information needs and changes in data collection are commonplace (Johnson et al., 1994; Johnson, 1996; Nadkarni and Brandt, 1998). In that context, the challenges that this model of storing attribute values poses to data retrieval have been recognised (Nadkarni and Brandt, 1998; Johnson and Chatziantoniou, 1999; Chen et al., 2000). Using only standard SQL, retrieval of
for $stratum in doc(’data.xml’)/ST_object [id ¼ ’1994_1’]/subobjects/ST_object/ and the tree level objects with the expression:
¨ ki / Environmental Modelling & Software 24 (2009) 1153–1162 J. Rasinma
a given set of attribute values requires either the use of deeply nested correlated sub-queries or an equal number of query unions. The same result can be obtained in XQuery using a single query that remains essentially unchanged regardless of the number of attributes being queried. The crux of the solution to the pivoting problem lies in nesting FLWOR expressions in the return clause of the main query. For a stand that satisfies the top level condition, the categorical attributes and numerical attributes must be iterated through only once in order to generate all the data columns needed. This is accomplished by a combination of a where clause that lists all the required variables and the return clause that combines the separate variable and value elements into a new single element that takes its name from the variable element content and its content from the value element content. The ability to extend the basic XPath and FLWOR based syntax of XQuery with user defined recursive functions is essential for the implementation of the sequencing operation. The recursive pastfunction defined in the sequencing query essentially equals the whole operation, which makes it easy to integrate the object history traversal with other data retrieval operations like pivoting. The XQuery extension approach also allows adding handling of time in a constraint or filtering role to the basic data retrieval operations. Thus, both concepts of time in the generic data model, time as a relation between objects, and time as an attribute of the object are facilitated by XQuery. The lack of a grouping clause in the FLWOR expression complicates the construction of aggregation queries (Borkar and Carey, 2004). Each grouping variable has to be processed with its own for clause which again generates a new variable in the main let clause of the aggregation query, thus making the query progressively more convoluted and harder to generate, optimise and maintain. Another problem associated with the aggregation queries is the omission of null values of the grouping variables from the result set; i.e., the aggregation query as presented in this study does not return the aggregated data values in those cases where the grouping variables have no values. To illustrate this, if the data doesn’t contain any stands of development class 5, it’s not present in the results for ‘‘average stand volume for each development class’’, and thus the user would not know of the existence of that development class from the results of the query alone. This can be fixed by augmenting the aggregation query with a sub-expression that specifically deals with the null values (Borkar and Carey, 2004). However, as the number of grouping variables increases, the number of sub-expressions required multiplies as all the combinations of possible null grouping variables have to be included in the query. The two choices for object hierarchy encoding provide different characteristics. By embedding the object hierarchy in the structure of the document itself, queries become straightforward to write as the whole data hierarchy for any particular top level object can be accessed and handled as a single data unit. However, the object embedding introduces an element of inflexibility to the queries forcing them to rely on the relative position of objects in the data hierarchy, e.g., ‘‘trees are the second level sub-objects of stands, and the first level sub-objects of strata’’. Should the data model change so that a level is added or removed from the data hierarchy, all the queries involving traversal of the hierarchy must be rewritten to reference the new, changed relative positions of data object levels. Furthermore, only top down navigation of the object hierarchy is facilitated by this approach, in which the subobjects can be passed easily from expression to expression. The other option in implementing the data hierarchy, i.e., utilising object identifiers to reference the super and sub level objects of any object in the hierarchy, provides flexibility with regard to the data object changes, and bidirectional navigation of the object hierarchy. The downside is that specific user defined functions are
1159
needed to navigate from one level to another in the data hierarchy. This is, however, alleviated by the fact that these functions are in essence similar to those needed for the traversal of temporal hierarchy. 6. Conclusion The generic spatio-temporal data model considered here and the XML data model have a degree of matching at the conceptual level in the way they both rely heavily on nested, hierarchical structures. This match is also apparent at the data retrieval operations. Whereas the implementation of the data model using the data definition structures of SQL is quite straightforward, the mismatch between the attribute–value-pairs model and tuple model of the relational algebra poses considerable challenges for data retrieval. These challenges are not obvious when accessing data from an XML data store using XQuery. Being based completely on expressions, XQuery is easily extendable, has strong support for nested structures, and data transformation is a core function. Further research is however needed for a complete data storage solution for spatio-temporal data. Beside data retrieval, data manipulation operations are also needed, and XQuery standard currently lacks any data manipulation function. Verbosity may also pose a problem for big data sets as there is a trade-off between clear semantics of data structures and the size of the encoded data sets. Acknowledgements The data set complete with the time period linking was provided by Risto Viitala from the Ha¨me Polytechnic. The study was financed by Metsa¨miesten sa¨a¨tio¨ and the Finnish Academy of Science. Appendix. Listings Listing 1. Pivoting value-per-row model using nested sub-queries in SQL. Returning the development class and area for each stand SELECT d.id, d.area, j.devClass FROM (SELECT id, value AS devClass FROM attr WHERE variable ¼ ’devClass’) j RIGHT OUTER JOIN (SELECT y.id, k.value AS area FROM (SELECT DISTINCT o.id FROM attr o) y LEFT OUTER JOIN (SELECT id, value FROM attr WHERE variable ¼ ’area’) k ON y.id ¼ k.id) d ON j.id ¼ d.id;
Listing 2. Pivoting the row based data into column format using XQuery The query: for $stand in collection(’test.dbxml’)//ST_object let $c_attrs : ¼ $stand/categorical_obs_part/c_obs, $n_attrs : ¼ $stand/numerical_obs_part/n_obs where $stand/id ¼ ‘1994_226.2’ return { $stand/id } { for $c_attr in $c_attrs
¨ ki / Environmental Modelling & Software 24 (2009) 1153–1162 J. Rasinma
1160
let $name : ¼ $c_attr/variable/text() where $name ¼ (’Main class’, ‘Sub class’) return element { $name } { $c_attr/value/text() } } { for $n_attr in $n_attrs let $name : ¼ $n_attr/variable/text() where $name ¼ (’Age’, ’Basal area’) return element { $name } { $n_attr/value/text() } } The result: 1994_226.2 1 1 17 4.1
Listing 3. History traversal of a stand using a sequencing XQuery. The past-function can have 0 to n id elements as input and it returns 0 to n stand elements back The query: declare function local:past($ids as element()*) as element()* { for $id in $ids return { $id/id, $id/valid_period, $id/categorical_ obs_part, $id/numerical_obs_part, $id/comment_obs_ part } { for $pred in $id/predecessors/predecessor return local:past(collection(’test.dbxml’)/ data/ST_object[id ¼ $pred]) } }; let $parent-id : ¼ ‘1994_440’ let $stands : ¼ local:past(collection(’test.dbxml’)/ data/ST_object[id ¼ $parent-id]) for $info in $stands//info order by $info/valid_period/begin_date return { $info/id, $info/valid_period, $info/categorical_obs_part, $info/numerical_obs_part, $info/comment_ obs_part } The result: .
1976_513 1976-01-01 1983-12-31 . 1984_309 . 1994_440 .
Listing 4. An aggregation query using a FLWOR expression. Notes on syntax: distinct-values is an XQuery function returning distinct values in its input data The query: for $data in collection(’test.dbxml’)/data for $c in distinct-values( $data/ST_object/categorical_obs_part/ c_obs[variable ¼ ’dev_class’]/value) let $c-group : ¼ $data/ST_object[categorical_ obs_ art/c_obs/variable ¼ ‘dev_class’ and categorical_obs_part/c_obs/value ¼ $c] where exists($c-group) order by $c return {avg($c-group/numerical_obs_part/n_obs[variable eq ’volume’]/value) } The result: 119.5 197.8 .
Listing 5. A query combining the sequencing and pivoting operations to return an object history in column format. See the definition of past-function in Listing 3. Notes on syntax: in ST_object[id ¼ $parent-id] an XPath expression is used to limit returned ST_object elements to the one having the same idelement value as the variable parent-id The query: declare function local:past($ids as element()*) as element()* . let $parent-id : ¼ ‘1994_440’ let $variable-list : ¼ (’age’, ‘basal area’, ’area’) let $stands : ¼ local:past(collection(’test.dbxml’)/ data/ST_object[id ¼ $parent-id]) for $info in $stands//info order by $info/valid_period/begin_date return
¨ ki / Environmental Modelling & Software 24 (2009) 1153–1162 J. Rasinma
{$info/id, $info/valid_period} { for $n_attr in collection(’test.dbxml’)/data/ ST_object[id ¼ $info/id]/ numerical_obs_part/n_obs let $name : ¼ $n_attr/variable/text() where $name ¼ $variable-list return element { $name } { $n_attr/value/text() } } The result: 1871_402 . 4.66 . 1907_253 . 100 3.87 . 1976_513 . 3.3 55 . 1994_440 . 80
Listing 6. An XQuery function that checks whether a particular year is within a period, including an open ended period declare function local:year-within($check-year as xs:integer, $start as xs:date*, $end as xs:date*) as xs:boolean* { if (empty($end)) then ( if ($check-year > ¼ year-from-date($start)) then true() else false() ) else ( if (($check-year > ¼ year-from-date($start)) and ($check-year <¼ year-fromdate($end))) then true() else false() ) };
1161
Listing 7. An XQuery function that filters the numerical observations of a stand so that only observations from a particular year are returned declare function local:filter-by-year-numerical($check-year as xs:integer, $stand as element()*, $num-var) as element()* { for $var in $num-var for $s in $stand for $n_obs in $s/numerical_obs_part/ n_obs[variable ¼ $var] let $begin : ¼ $n_obs/valid_period/begin_date, $end : ¼ $n_obs/valid_period/end_date where local:year-within($check-year, $begin, $end) return { $s/id } { $n_obs } };
Listing 8. An XQuery that pivots the categorical and numerical observations from a particular year The query: let $target-year : ¼ 1984, $c-attrs : ¼ (’dev_class’,’site_type’), $n-attrs : ¼ (’area’) for $data in doc(’data.xml’)/data let $r: ¼ local:filter-by-year-categorical($targetyear, $data/ST_object, $c-attrs) let $s: ¼ local:filter-by-year-numerical($targetyear, $data/ST_object, $n-attrs) return { let $id_list : ¼ ($r j $s)/id for $id in distinct-values($id_list) return { $id } { for $stand in $r where $stand/id ¼ $id return element {$stand/c_obs/variable/text()} {$stand/c_obs/value/text()} } { for $stand in $s where $stand/id ¼ $id return element {$stand/n_obs/variable/text()} {$stand/n_obs/value/text()} } } The result: 1984_213.1 05 131 2.0
1162
¨ ki / Environmental Modelling & Software 24 (2009) 1153–1162 J. Rasinma
1984_213.9 05 131 7.9
Listing 9. An aggregation query using a temporal constraint and two grouping variables over a data variable. Note on syntax: ($id_1 except $id_2) is used to return the intersection of two node sets; i.e., only stand ids that are both in id_1 and id_2 node collection are returned The query: let $target-year : ¼ 1988, $cat-grouping-var : ¼ (’dev_class’, ‘site_ type’), $data-var : ¼ ’age’ for $data in doc(’data.xml’)/data let $s: ¼ local:filter-by-year-categorical($targetyear, $data/ST_object, $catgrouping-var) let $n: ¼ local:filter-by-year-numerical($targetyear, $data/ST_object, $data-var) return { for $vals-1 in distinct-values($s/c_obs[variable ¼ $cat-grouping-var[1]]/value) for $vals-2 in distinct-values($s/c_obs[variable ¼ $cat-grouping-var[2]]/value) let $id_1 : ¼ $s[c_obs/variable ¼ $cat-groupingvar[1] and c_obs/value ¼ $vals-1]/id let $id_2 : ¼ $s[c_obs/variable ¼ $cat-groupingvar[2] and c_obs/value ¼ $vals-2]/id let $id-group : ¼ ($id_1 except $id_2) where exists($id-group) return { element {$cat-grouping-var[1]} {$vals-1} } { element {$cat-grouping-var[2]} {$vals-2} } { avg($n[id ¼ $id-group]/ n_obs[variable ¼ $data-var]/value) } } The result: 06 13115 05131 95 .
References Alted, F., Ferna´ndez-Alonso, M., 2003. PyTables: processing and analysing extremely large amounts of data in python. Available from: http://www.pytables.org/docs/ pycon2003-paper.pdf (accessed 19th April, 2006). Anon, 2006. Hyva¨n metsa¨nhoidon suositukset. Metsa¨kustannus, Helsinki (in Finnish). XML XPath Language (XPath) 2.0. In: Berglund, A., Boag, S., Chamberlin, D., Ferna´ndez, M.F., Kay, M., Robie, J., Sime´on, J. (Eds.) Available from: http://www. w3.org/TR/xpath20/ (accessed 19th April, 2006). XQuery 1.0: an XML query language. In: Boag, S., Chamberlin, D., Ferna´ndez, M.F., Florescu, D., Robie, J., Sime´on, J. (Eds.) Available from: http://www.w3.org/TR/ xquery/ (accessed 18th April, 2006). Borkar, V., Carey, M., 2004. Extending XQuery for grouping, duplicate elimination, and outer joins. Proceedings of the XML2004 Conference. Available from: http://www.idealliance.org/proceedings/xml04/papers/229/ XQueryExtensionsFinal.html (accessed 19th April, 2006). Bray, T., Paoli, J., Sperberg-McQueen, C.M., Maler, E., Yergeau, F. (Eds.), 2004. Extensible Markup Language (XML) 1.0, third ed. Available from: http://www. w3.org/TR REC-xml/ (accessed 18th April, 2006). Chatziantoniou, D., 1999. Evaluation of Ad Hoc OLAP: In-Place Computation. In: 11th International Conference on Scientific and Statistical Database Management July 28–30, 1999, Cleveland, Ohio, pp. 34–43. Chatziantoniou, D., Ross, K.A., 1996. Querying multiple features of groups in relational databases. In: Vijayaraman, T.M., Buchmann, A.P., Mohan, C., Nandlal, L.S. (Eds.), VLDB’96, Proceedings of 22nd International Conference on Very Large Data Bases. September 3–6, 1996, Mumbai (Bombay), India, pp. 295–306. Chen, R.S., Nadkarni, P., Marenco, L., Levin, F., Erdos, J., Miller, P.L., 2000. Exploring performance issues for a clinical database organized using an entity–attribute– value representation. Journal of the American Medical Informatics Association 7 (5), 475–487. Chaudhuri, S., Weikum, G., 2000. Rethinking Database System Architecture: Towards a Self-tuning RISC-style Database System. In: Proceedings of the 26th International Conference on Very Large Databases, Cairo, Egypt, pp. 1–10. Codd, E.F., 1979. Extending the relational model to capture more meaning. ACM Transactions on Database Systems 4 (4), 397–434. Datta, A., Moon, B., Ramamritham, K., Thomas, H., Viguier, I., 1998. ‘‘Have Your Data and Index It Too’’: Efficient Storage and Indexing for Data Warehouses. Technical Report 98–7. Department of Computer Science, University of Arizona. Duane, W., Kidd, D., Livingstone, D., 2000. Integrating environmental models with GIS; an object-oriented approach utilising a Hierarchical Data Format (HDF) data repository. Transactions in GIS 4 (3), 263–280. Eisenberg, A., Melton, J., 1999. SQL:1999, formerly known as SQL3. SIGMOD Record 28 (1), 131–138. Fallside, D.C., Walmsley, P. (Eds.), 2004. XML Schema Part 0: Primer Available from: http://www.w3.org/TR/xmlschema-0/ (accessed 4th September, 2007). Folk, M., McGrath, R.E., Yeager, N., 1999. HDF: An Update and Future Directions. Geoscience and Remote Sensing Symposium, 1999. IGARSS ’99 Proceedings, vol. 1, Hamburg, Germany, pp. 273–275. Jaeschke, G., Schek, H.J., 1982. Remarks on the algebra of non first normal form relations. In: Proceedings of the 1st ACM SIGACT–SIGMOD Symposium on Principles of database systems, Los Angeles, pp. 124–138. Johnson, S.B., 1996. Generic data modeling for clinical repositories. Journal of the American Medical Informatics Association 3 (5), 328–339. Johnson, S.B., Chatziantoniou, D., 1999. Extended SQL for manipulating clinical warehouse data. Proceedings of the American Medical Informatics Association Symposium 1999, 819–823. Johnson, S.B., Hripcsak, G., Chen, J., Clayton, P.D., 1994. Accessing the Columbia clinical repository. In: Proceedings of the 18th Annual Symposium on Computer Applications in Medical Care, November 5–9, 1994, Washington (DC). McGraw Hill, New York. Kim, W., Seo, J., 1991. Classifying schematic and data heterogeneity in multidatabase systems. IEEE Computer 24 (12), 12–18. Krishnamurthy, R., Litwin, W., Kent, W., 1991. Language features for interoperability of databases with schematic discrepancies. In: Clifford, J., King, R. (Eds.), Proceedings of the 1991 ACM SIGMOD International Conference on Management of Data. Denver, Colorado, May 29–31, 1991, pp. 40–49. Nadkarni, P.M., Brandt, C.,1998. Data extraction and ad hoc query of an entity–attribute– value database. Journal of American Medical Informatics Association 5 (6), 511–527. Nestorov, S., Tsur, S., 1999. Integrating data mining with relational DBMS: A tightlycoupled approach. In: Next Generation Information Technologies and Systems, 4th International Workshop, NGITS’99 Zikhron-Yaakov, Israel, July 5–7, 1999. Rasinma¨ki, J., 2003. Modelling spatio-temporal environmental data. Environmental Modelling & Software 18, 877–886. Reynolds, K.M., 2005. Integrated decision support for sustainable forest management in the United States: fact or fiction? Computers and Electronics in Agriculture 49 (1), 6–23. Ross, K.A., Srivastava, D., Chatziantoniou, D., 1998. Complex aggregation at multiple granularities. Lecture Notes in Computer Science 1377, 263–277. Roth, M.A., Korth, H.F., Silberschatz, A., 1988. Extended algebra and calculus for nested relational databases. ACM Transactions on Database Systems 13 (4), 389–417. Seltzer, M., 2005. Beyond relational databases. Databases 3 (3), 50–58. Tokola, T., Turkia, A., Sarkeala, J., Soimasuo, J., 1997. An entity-relationship model for forest inventory. Canadian Journal of Forest Research 27, 1586–1594. Winston, P.H., 1984. Artificial Intelligence. Addison Wesley.