Pitfalls and possible solutions for using geo-referenced site data to inform vegetation models

Pitfalls and possible solutions for using geo-referenced site data to inform vegetation models

ECOINF-00576; No of Pages 5 Ecological Informatics xxx (2015) xxx–xxx Contents lists available at ScienceDirect Ecological Informatics journal homep...

2MB Sizes 0 Downloads 33 Views

ECOINF-00576; No of Pages 5 Ecological Informatics xxx (2015) xxx–xxx

Contents lists available at ScienceDirect

Ecological Informatics journal homepage: www.elsevier.com/locate/ecolinf

Pitfalls and possible solutions for using geo-referenced site data to inform vegetation models Megan J. McNellie a,b,⁎, Ian Oliver c,d, Philip Gibbons b a

Office of Environment and Heritage New South Wales, PO Box 5336, Wagga Wagga, NSW 2650, Australia Fenner School of Environment and Society, The Australian National University, Frank Fenner Building, Linnaeus Way, Acton, ACT 2601, Australia Office of Environment and Heritage New South Wales, University of New England, PO Box U221, Armidale, NSW 2351, Australia d School of Environmental and Rural Science, University of New England, Armidale, NSW 2351, Australia b c

a r t i c l e

i n f o

Article history: Received 28 December 2014 Received in revised form 22 May 2015 Accepted 26 May 2015 Available online xxxx Keywords: Point data Error Datum shift Duplicate Near neighbour Data integrity

a b s t r a c t Most predictive models rely on ‘the known’ to infer ‘the unknown’. Geo-referenced, on-ground observational data are the ‘point of truth’ upon which many vegetation models are built. We focus on some of the enigmatic errors that we have uncovered when using vegetation plot data. Using a case study, we sourced 9362 sites to examine the prevalence of spatial errors. We found that an incorrect datum was recorded for 5% of sites; less than 2% of sites were duplicated and up to 34% of sites were located within 1000 m of each other. Whilst sites within a 1000 m neighbourhood are not necessarily errors, they do need to be considered within the context of using spatial environmental layers and predictive modelling. We offer solutions for identifying and managing spatial locations of point data to ensure that the information-rich resource held in data repositories is not compromised by unidentified spatial error. Crown Copyright © 2015 Published by Elsevier B.V. All rights reserved.

1. Introduction Most predictive models rely on ‘the known’ to infer ‘the unknown’. Geo-referenced, on-ground site data are the ‘point of truth’ upon which many vegetation models are built. On-ground observations, variously referred to as sites, plots, presence records, occurrence records, observations, relevés, response data, informants, exemplars or training data are stored in multitudes of databases held by museums and herbaria; flora atlases; government and non-government agencies or in private collections. Collectively, these data represent a massive resource, given the effort required to attain and store. Re-using these data capitalises on this significant investment. However, enigmatic errors in point data can affect the accuracy of models derived from them. Geo-referenced observational data, in its simplest form, describes the location and presence of an entity. These data can be used to model the distribution of a range of biological entities, such as species (e.g. Ferrier et al., 2002), habitats (e.g. Guisan and Zimmermann, 2000), or communities (e.g. Ferrier and Guisan, 2006). When coupled with spatially explicit environmental predictors, site data are a valuable

⁎ Corresponding author. E-mail addresses: [email protected] (M.J. McNellie), [email protected] (I. Oliver), [email protected] (P. Gibbons).

information-rich resource that can be manipulated to serve a number of global conservation biodiversity needs or natural resource management issues. In addition, site data serve as an important resource for validating data acquired using sensor-based technology, such as that returned by Ikonos, Quickbird, LiDAR or Landsat ETM (Kerr and Ostrovsky, 2003). To capitalise on the investment made in point data, the georeferenced location needs to describe accurately the position in the landscape (Guralnick et al., 2006). Several authors have outlined some of the effects of inaccurate positional information (Chapman, 2005; Graham et al., 2008; Moudrý and Šímová, 2012; Osborne and Leitão, 2009; Vaughan and Ormerod, 2003). We further delve into errors in point locations, focusing on datum shifts and the consequent errors relating to duplicate data and near neighbours. Using a case study we highlight some of the errors that we uncovered when using archived site data. We offer methods to evaluate these data and guide practitioners towards some simple solutions to optimise this valuable information-rich resource. We do not cover other well documented sources of error (Chapman, 2005; Cook et al., 2010; Rondinini et al., 2006; Vaughan and Ormerod, 2003) such as taxonomic errors, observer error or bias, temporal limitations or determining species absence. Nor do we cover issues relating to spatial autocorrelation. Most ecological data are clumped or clustered in environmental space. Assessing and accounting for spatial autocorrelation needs to be treated as a separate process when using point data to inform models (Dormann et al., 2007).

http://dx.doi.org/10.1016/j.ecoinf.2015.05.012 1574-9541/Crown Copyright © 2015 Published by Elsevier B.V. All rights reserved.

Please cite this article as: McNellie, M.J., et al., Pitfalls and possible solutions for using geo-referenced site data to inform vegetation models, Ecological Informatics (2015), http://dx.doi.org/10.1016/j.ecoinf.2015.05.012

2

M.J. McNellie et al. / Ecological Informatics xxx (2015) xxx–xxx

2. The specialty of spatial errors — a case study We sourced our geo-referenced site data from the BioNet Database (http://www.bionet.nsw.gov.au). This database is a repository of terrestrial vascular plant information collected within New South Wales, Australia (Fig. 1). Using 9362 sites as our case study we describe some of the pitfalls we encountered when assessing error in archived sites. To evaluate the potential for spatial error and to measure the distance between sites using spatial analyses, point data need to be projected into a common reference system (see Appendix 1 for definitions of reference systems). Spatial analyses cannot be performed accurately when points are in a geographic coordinate system (such as latitude and longitude) (Snyder, 1997). All our points were projected to Lambert Conic Conformal projection. 3. Shifting datum Incorrect or misrepresented datum (see Appendix 1 for definition of datum) will cause sites to be positioned in the wrong location. Shifting datum is a particular pitfall related to using archived site data. As an example, in Australia many on-ground field sites were located using printed topographic maps to pin-point their on-ground location. Most hard copy topographic maps used local datum – Australian Geodetic Datum 1966 (AGD 66). As handheld global positioning system (GPS) units became affordable, field practices changed, and location could be determined using GPS units (which are usually factory set to use a standard global datum – World Geographic System 1984 (WGS 84)). This change in practice opened an avenue for the incorrect datum to be recorded in the field or stored incorrectly once data were deposited into a digital database. The difference between AGD 66 and WGS 84

translates to about 133 m north-south and 48 m east-west (NIMA 1997) resulting in sites being located about 210 m south-west of the correct location. Of the 9362 sites in our case study, 502 (5.4%) were found to have incorrect datum. Fig. 2 demonstrates an example of where sites would fall if they were recorded in AGD 66 compared to WGS 84. 3.1. Solutions for identifying incorrect datum Datum shift is a cryptic problem that can only be assessed on-screen at a resolution of 1:10 000 or finer, using fine-scaled imagery as a backdrop. Using ground control features such as land tenure boundaries (e.g. road or conservation reserves) and landscape features (e.g. watercourses), practitioners need to check for a systematic shift in sites. We emphasise that sites need to be inspected to account for a systematic shift, because typically the same methods (that is, site location is recorded using a printed topographic map or navigational GPS) are used for all sites within a survey. Suspected systematic shifts (datum errors) need to be confirmed by referring to site maps within original survey reports or metadata. Whilst most countries have moved to a global reference system (Iliffe and Lott, 2008), archival site data, collected over a long time span, may have inadvertent error recorded in the details of the site location. Table 1 is provided to help practitioners estimate the offset distance for sites plotted with an incorrect datum. Checking for datum errors is time-consuming and not all surveys lend themselves to this type of checking. For example, surveys that target multiple tenures do not neatly align to linear boundaries and it is nearly impossible to check if datum was recorded correctly. Whilst on-screen checking of point locations may seem like an arduous undertaking, especially

Fig. 1. Study area covering 11.5 million hectares (approximately 14% of New South Wales, Australia). This area is equivalent to the landmass of England. Darker shaded areas show topographic relief. Within this area, we sourced 9362 sites to examine the prevalence of potential errors.

Please cite this article as: McNellie, M.J., et al., Pitfalls and possible solutions for using geo-referenced site data to inform vegetation models, Ecological Informatics (2015), http://dx.doi.org/10.1016/j.ecoinf.2015.05.012

M.J. McNellie et al. / Ecological Informatics xxx (2015) xxx–xxx

3

Fig. 2. An example of how the local datum (AGD 66) is offset from the global datum (WGS 84). Detecting this type of error is only possible by visual inspection of the points plotted in the landscape and potential anomalies checked against the survey metadata.

where point data are dispersed over vast areas, we found that these errors were constrained to a window of time. For example, in our case study, we found that datum errors were more prevalent from 2004 to 2008, a time where field surveyors typically used both printed topographic maps in local datum and hand-held GPS units set to the global datum to locate field sites. Datum errors are obscure because it is entirely possible that each of the locations marked by black circles (Fig. 2) were candidate sites and were surveyed. However, the accompanying metadata stated that the survey was conducted within the boundaries of the nature reserve for identifying native vegetation communities. Given the distribution of points in Fig. 2 showing that five sites were located outside the boundary of the nature reserve and two were located in cleared paddocks, we concluded there was an error in recording WGS 84 as the datum and that site locations were recorded in AGD 66. We confirmed a datum miss-match between the information stored in the database and the

true datum of the location by referring to the site map in the original report. Another potential solution for managing datum errors, where they may be suspected but not able to be confirmed, is to ensure the scale (or the grain size) of the environmental predictor variables is greater than the longest offset distance. 4. Spatial interrogation to identify near neighbours Spatial analyses used to measure the distance between two sites can illuminate a variety of issues that need consideration. Correct measurements of distance between sites hinges on all point data having a projected reference system and the correct and common datum. We found that by measuring the nearness of neighbours we were able to uncover issues that need to be considered when using point data to inform models. The first issue is that of duplicate sites, that is two (or

Table 1 North-south and east-west shifts between local geodetic datum and the global standard WGS 84 for some areas. Information on the east-west and north-south shifts (m) extracted from NIMA (1997). Distance measures are the mean solution for the entire landmass. Individual countries or states within the landmass have slightly different offsets. Area

Local datum

Australia Great Britain

AGD 66 OSGB 36 OSGB 80 CAPE ED 50 NAD 27 NAD 83 SAD 56 SAD 69 Tokyo Datum

South Africa Eurasian Plate North America South America Japan

Australian Geodetic Datum Ordnance Survey of Great Britain 1936 Ordnance Survey of Great Britain 1980 European Datum 1950 North American Datum North American Datum Provisional South American 1956 South American 1969 Tokyo Datum

Realisation

Shift (m) E-W

Shift (m) N-S

1966 1936 1980 1987 1950 1927 1983 1956 1969 1991

−133 375 −86 −136 −87 −8 0 −288 −57 −148

−48 −111 −96 −108 −98 160 0 175 1 507

Please cite this article as: McNellie, M.J., et al., Pitfalls and possible solutions for using geo-referenced site data to inform vegetation models, Ecological Informatics (2015), http://dx.doi.org/10.1016/j.ecoinf.2015.05.012

4

M.J. McNellie et al. / Ecological Informatics xxx (2015) xxx–xxx

more) sites are 0 m apart, the second issue was repeated measures over time and the third issue relates to ensuring that there is never more than one exemplar per spatial unit (grid cell or polygon) used to inform the model. We briefly discuss these issues below and then follow with potential solutions.

(Fig. 3) demonstrate an appropriate spacing of points; the distance between sites is commensurate with the grid cell size, in this case a 25 m. Whereas the distance between Sites 1, 2 and 6 (Fig. 3) is possibly too close when using a 50 m grid cell. 5. Solutions for near neighbours

4.1. Duplicate data There are at least two sources of error that can result in duplicated site locations. First, duplicates can arise when identical sites stored in separate databases are merged. Second, transcription errors when recording or entering the site location into the digital repository can result in duplicated site locations. Both are errors which result in two sites appearing at the same point (see Sites 3 and 8 in Fig. 3). By identifying sites with identical co-ordinates, we found 179 (1.9%) pairs of duplicated sites. In our case study, duplication was especially enigmatic because each member of the pair had a unique site name. Spatial interrogation uncovered these errors that may have been overlooked had we relied on scrutinising only the site name. 4.2. Re-visiting the same site The second issue uncovered by measuring the distance between sites is to determine if sites were sampled more than once either intentionally or serendipitously. Where surveyors have visited the same location (such as remnant patches of native vegetation) on different dates, these sites need to be identified. These data may be useful in assessing change over time, however if that is not the explicit intention, then different information from the same site may cause model degradation. There are some instances, such as recording species presence, where multiple records in one spatial unit (grid cell or polygon) provides additional robustness to the model. Given that most GPS units are not accurate within 10 m (Wing et al., 2005) we defined sites that were within 10 m of each other as re-visited sites. When we measured the distance between sites, we found 32 sites (b 1%) that were less than 10 m apart. 4.3. Distance between sites needs to be commensurate with the spatial unit The nominated distance between sites that constitutes them as being near neighbours, will depend on the study area (AielloLammens et al., 2015), spatial scale of the environmental predictors and of the end-product model (Elith and Leathwick, 2009). The resolution of the grid cells (grain size) used to represent the environmental predictor can influence the ‘nearness’ of site data. Fig. 3 demonstrates that nearness can be influenced by the grain size of the environmental predictors. In our case study, we found that 301 sites (3.2%) were within 100 m; 2050 sites (21.9%) were within 500 m and 3199 sites (34.1%) of sites were within 1000 m of each other. For example, Sites 4, 5 and 7

Spatial analyses used to identify duplicated or revisited neighbours are similar. For example, Sites 3 and 8 in Fig. 3 could be a result of either merging datasets or re-visiting the same location. The processes that caused sites to be neighbours are important and warrants closer examination. 5.1. Identifying duplicate sites Where the duplicate datasets were inadvertently merged, sites were first identified by comparing textual information (survey details or site identifier). Simple spatial analyses (such as distance to nearest point) within a geographic information system (GIS) to locate sites with zero or sub-metre distances between them provide a simple solution to identify duplicate pairs. Where duplicated sites occurred due to transcription errors in recording location, and response data differed, both sites were deleted as it is most likely impossible to determine which is correct, unless additional time is invested in cross referencing with original field data sheets or notes (if available). Where sites were duplicated because they were imported more than once, and response data do not differ, a pragmatic decision about which to keep and which to remove, retaining only one of the duplicated sites was required. 5.2. Identifying re-visits Depending on the purpose of the model, re-visits can be useful. However most ‘static’ models do not incorporate the temporal component of change, so including both visits would confound model predictions. Scrutinising the data recorded from a single site at two different times and selecting the single most appropriate survey date is possibly the most parsimonious solution. Within our case study, few sites were identified as re-visits, this type of error may not be prevalent in most datasets. 5.3. Identifying near neighbours Distance between sites can be measured within GIS using distance queries or by simple arithmetic operations on the X and Y coordinates (of a projected reference system). Measuring distance within the GIS has benefits of being able to spatially locate the near neighbours and select sites to be filtered from the suite of sites. The minimum separation distance between sites should be informed by the attribute being modelled and the grain size of the environmental predictors. Where response data are amenable to averaging (such as foliage cover estimates, or abundance), data may not need to be removed, however they must be identified and suitably managed. 6. Discussion

Fig. 3. Schematic diagram to illustrate some of the different aspects of near neighbours, including duplicated sites within the context of spatial scale of the environmental predictors.

Point data are the central pivot of most model building. Globally, there is significant progress towards storing point data in central and communal repositories (Guralnick et al., 2006). However, data derived from multiple sources have varying or unknown integrity (Franklin, 2009). There are numerous opportunities to import or generate errors when using site data. Given these data represent a massive investment in time and money, and are often used to inform solutions to important ecological and conservation problems, we have identified some enigmatic sources of error and made practical suggestions to remedy these errors.

Please cite this article as: McNellie, M.J., et al., Pitfalls and possible solutions for using geo-referenced site data to inform vegetation models, Ecological Informatics (2015), http://dx.doi.org/10.1016/j.ecoinf.2015.05.012

M.J. McNellie et al. / Ecological Informatics xxx (2015) xxx–xxx

Our case study has revealed sites that were incorrectly positioned, were duplicated with other sites, and were sampled more than once. Once identified, data were checked and cross referenced against original data sources. Outright transcription errors were removed or corrected; multiple observations from the same location were treated and near neighbours were measured so they were commensurate with the spatial grain of the predictor surfaces. These were fundamental steps required for preparing robust response data. However, not all models require erroneous data points to be removed from the set of informants. Some studies have found that different modelling frameworks responded differently to errors in geo-referenced point data. Graham et al. (2008) concluded that coarser-scaled modelling approaches were fairly robust to some locational errors. Point data have great potential. Given that these data are considered the ‘point of truth’, and are extensively used to inform spatial ecology and conservation, they must be correctly geo-referenced. Where these data are not correctly located, the value of these ecoinformatics is compromised. The solutions to some of the pitfalls identified in this paper will ultimately save time, effort and money and build better biodiversity informatics. Acknowledgements Sarah Hill assisted with data preparation and Philip Gleeson ensured that all errors identified were corrected in the BioNet database. We appreciate the comments of two anonymous reviewers. PG is partly funded by the Environmental Decisions Hub of the Australian Government's National Environmental Research Program. Appendix 1 Datum — a datum is the reference information needed to fix a coordinate system against the modelled shape of the earth (Iliffe and Lott, 2008). There are hundreds of local datum systems that have been developed independently for specific areas. Many topographic maps have been printed and use the local datum. Most handheld GPS units are factory set to WGS 84. There is no incorrect choice of datum. Either the local datum or a global datum can be used to reference a site. If the datum is unknown or incorrect, this will result in error. For fine-scaled predictive modelling, incorrect datum information will result in an unacceptably large error. Geographic coordinate systems — a global or spherical coordinate system such as latitude–longitude and is measured in units of decimal degrees. Projected coordinate system — a projected coordinate system such as Lambert Conic Conformal, Albers Equal Area Conic or Universal Transverse Mercator, all of which (along with numerous other map projection models) enable linear units of measurement. Projected

5

coordinate systems provide various mechanisms to project maps of the earth's spherical surface onto a two-dimensional Cartesian coordinate plane. Projected coordinate systems are referred to as map projections, and projected coordinates (such as easting and northing values) are measured in linear units such as metres, feet, miles or kilometres.

References Aiello-Lammens, M.E., Boria, R.A., Radosavljevic, A., Vilela, B., Anderson, R.P., 2015. spThin: an R package for spatial thinning of species occurrence records for use in ecological niche models. Ecography 38, 000–005. Chapman, A.D., 2005. Principles and Methods of Data Cleaning — Primary Species and Species-Occurrence Data. Global Biodiversity Information Facility, Copenhagen. Cook, C.N., Wardell-Johnson, G., Keatley, M., Gowans, S.A., Gibson, M.S., Westbrooke, M.E., Marshall, D.J., 2010. Is what you see what you get? Visual vs. measured assessments of vegetation condition. J. Appl. Ecol. 650–661. NIMA, 1997. Department of Defense World Geodetic System 1984: Its Definition and Relationships with Local Geodetic Systems Technical Report 8350. In: Defense, D.o. (Ed.), National Imagery and Mapping Agency, Bethesda Maryland, p. 175. Dormann, C.F., McPherson, J.M., Araújo, M.B., Bivand, R., Bolliger, J., Carl, G., Davies, R.G., Hirzel, A., Jetz, W., Daniel Kissling, W., Kühn, I., Ohlemüller, R., Peres-Neto, P.R., Reineking, B., Schröder, B., Schurr, F.M., Wilson, R., 2007. Methods to account for spatial autocorrelation in the analysis of species distributional data: a review. Ecography 30, 609–628. Elith, J., Leathwick, J.R., 2009. Species distribution models: ecological explanation and prediction across space and time. Annu. Rev. Ecol. Evol. Syst. 40, 677–697. Ferrier, S., Guisan, A., 2006. Spatial modelling of biodiversity at the community level. J. Appl. Ecol. 43, 393–404. Ferrier, S., Watson, G., Pearce, J., Drielsma, M., 2002. Extended statistical approaches to modelling spatial pattern in biodiversity in northeast New South Wales. I. Specieslevel modelling. Biodivers. Conserv. 11, 2275–2307. Franklin, J., 2009. Mapping Species Distributions: Spatial Inference and Prediction Cambridge University Press. Graham, C.H., Elith, J., Hijmans, R.J., Guisan, A., Peterson, A.T., Loiselle, B.A., The Participants of the NCEAS Predecting Species Distribution Working Group, 2008. The influence of spatial errors in species occurrence data used in distribution models. J. Appl. Ecol. 45, 239–247. Guisan, A., Zimmermann, N.E., 2000. Predictive habitat distribution models in ecology. Ecol. Model. 135, 147–186. Guralnick, R.P., Wieczorek, J., Beaman, R., Hijmans, R.J., The BioGeomancer Working Group, 2006. BioGeomancer: Automated Georeferencing to Map the World's Biodiversity Data. PLoS Biol. 4, e381. Iliffe, J.C., Lott, R., 2008. Datums and Map Projections for Remote Sensing. GIS and Surveying, University College, London. Kerr, J.T., Ostrovsky, M., 2003. From space to species: ecological applications for remote sensing. Trends Ecol. Evol. 18, 299–305. Moudrý, V., Šímová, P., 2012. Influence of positional accuracy, sample size and scale on modelling species distributions: a review. Int. J. Geogr. Inf. Sci. 26, 2083–2095. Osborne, P.E., Leitão, P.J., 2009. Effects of species and habitat positional errors on the performance and interpretation of species distribution models. Divers. Distrib. 15, 671–681. Rondinini, C., Wilson, K.A., Boitani, L., Grantham, H., Possingham, H.P., 2006. Tradeoffs of different types of species occurrence data for use in systematic conservation planning. Ecol. Lett. 9, 1136–1145. Snyder, J.P., 1997. Flattening the Earth: Two Thousand Years of Map Projections. University of Chicago Press. Vaughan, I.P., Ormerod, S.J., 2003. Improving the quality of distribution models for conservation by addressing shortcomings in the field collection of training data. Conserv. Biol. 17, 1601–1611. Wing, M.G., Eklund, A., Kellogg, L.D., 2005. Consumer-grade global positioning system (GPS) accuracy and reliability. J. For. 103, 169–173.

Please cite this article as: McNellie, M.J., et al., Pitfalls and possible solutions for using geo-referenced site data to inform vegetation models, Ecological Informatics (2015), http://dx.doi.org/10.1016/j.ecoinf.2015.05.012