Quality and usability challenges of global marine biodiversity databases: An example for marine mammal data

Quality and usability challenges of global marine biodiversity databases: An example for marine mammal data

Journal Pre-proof Quality and usability challenges of global marine biodiversity databases: An example for marine mammal data Vítězslav Moudrý, Rodol...

1MB Sizes 0 Downloads 59 Views

Journal Pre-proof Quality and usability challenges of global marine biodiversity databases: An example for marine mammal data

Vítězslav Moudrý, Rodolphe Devillers PII:

S1574-9541(20)30001-7

DOI:

https://doi.org/10.1016/j.ecoinf.2020.101051

Reference:

ECOINF 101051

To appear in:

Ecological Informatics

Received date:

19 August 2019

Revised date:

15 December 2019

Accepted date:

31 December 2019

Please cite this article as: V. Moudrý and R. Devillers, Quality and usability challenges of global marine biodiversity databases: An example for marine mammal data, Ecological Informatics(2020), https://doi.org/10.1016/j.ecoinf.2020.101051

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2020 Published by Elsevier.

Journal Pre-proof Quality and usability challenges of global marine biodiversity databases: An example for marine mammal data Vítězslav Moudrý 1,*, Rodolphe Devillers 2 1 Department of Applied Geoinformatics and Spatial Planning, Faculty of Environmental Sciences, Czech University of Life Sciences Prague, Kamýcká 129, Praha - Suchdol, 165 00, Czechia. E-mail: [email protected] 2

f

Department of Geography, Memorial University of Newfoundland, 300 Prince Philip Drive, St John's, NL, A1B 3X9, Canada. E-mail: [email protected]

pr

oo

Corresponding author: Vítězslav Moudrý

e-

Abstract

Pr

Knowing spatial and temporal patterns of species distribution is paramount to support marine species persistence. While datasets provided by global aggregators are increasingly rich and useful,

al

they suffer from various types of data quality issues that can impact their usage. Using marine

rn

mammals as an example, we assessed the quality and information gaps in species distribution data from three major databases: the Global Biodiversity Information Facility (GBIF), the Ocean

Jo u

Biogeographic Information System (OBIS) and the International Union for Conservation of Nature (IUCN) range maps. We analysed marine mammal records from 2015 (n=1,396,581) and from 2019 (n=1,904,968), for six types of common quality or usability issues. Results for both OBIS and GBIF indicate that 35 to 55% (depending on the respective database and year) of individual database’s records are potential duplicates, fall on land, or miss a data collection date. The positional accuracy of data records varies greatly due to varying precision and rounding of geographic coordinates. However, coordinate precision is specified only in 45% and 70% of records in GBIF and OBIS, respectively. In 2019, only approximately 70% of GBIF and OBIS records are encoded using more than three decimals (i.e. remaining records have a positional accuracy of >100m). We also quantified that only 19% (n=135,885) and 11% (n=133,882) of the records in 2015 and 2019, respectively, were

Journal Pre-proof common to OBIS and GBIF. Despite the continuous increase of number of records in both databases, the number of shared records slightly decreased. It is therefore likely that new records added to GBIF and OBIS between 2015 and 2019 comes from different data providers. Finally, to identify potential information gaps in marine mammal distributions, we overlaid IUCN range maps and species occurrences from global databases. We found that areas previously identified as hotspots for marine mammals’ diversity show some of the highest rates of potential false positives (i.e. species are thought to occur there based on their range map, but no species record exist in either GBIF or OBIS).

oo

f

While global biodiversity databases are key to assess global species distribution patterns, our study points to challenges that can limit data usability in biodiversity research. Improving existing data

pr

entry mechanisms, quality control routines as well as data exchange between aggregators should

e-

help make those database more useful to the community and reduce the risks of misuse of biological

Pr

data.

1. Introduction

Jo u

rn

al

Keywords: Error, GBIF, OBIS, Range maps, Scale

At a time of rapid changes in global marine biodiversity patterns and increasing anthropogenic pressures (Halpern et al., 2008, 2015), understanding spatial and temporal patterns of species distribution at various scales is paramount to support marine species persistence (Corrigan et al., 2014). Such a need led to the development of global databases, such as the Global Biodiversity Information Facility (GBIF), the Ocean Biogeographic Information System (OBIS) and the International Union for Conservation of Nature (IUCN) range maps. While datasets provided by global aggregators are increasingly rich and useful, they were shown to suffer from various types of data quality issues, such as duplicate records (Mesibov, 2018), records

Journal Pre-proof with high positional uncertainty (Otegui et al., 2013; Maldonado et al., 2015), heterogeneity amongst taxa (i.e. our knowledge of species distribution remains poor for most taxa), and to be spatially biased (i.e. uneven distribution of biodiversity information) (Webb et al., 2010; Jetz et al., 2012; Amano and Sutherland, 2013; Meyer et al., 2015, 2016; Amano et al., 2016; Menegotto and Rangel, 2018). Those issues keep challenging the usability of those datasets, which can in turn impact the accuracy of species distribution models (SDMs; Beck et al., 2014; Gábor et al. 2019a) or global species richness characterization (Turak et al., 2017; Menegotto and Rangel, 2018; Peterson and Soberón,

oo

f

2018). To support the international collaboration related to biodiversity data quality Biodiversity

promote best practices throughout the community.

pr

Information Standards group was established (https://www.tdwg.org/) to develop, adapt and

e-

Species occurrence data from global databases are now routinely used in SDMs in order to predict

Pr

probabilities of species occurrences at unsampled sites ( such as AquaMaps; Kesner-Reyes et al., 2016). It has been however suggested that high-quality species occurrence records (i.e. unbiased

al

data, positionally accurate) are essential to generate informative and accurate SDMs ( Duputié et al.,

rn

2014). While some recent studies explored the effect of spatial bias in global databases on SDMs

Jo u

(Beck et al., 2014; Pender et al., 2019), the positional and temporal accuracy of occurrences remain overlooked (but see Maldonado et al., 2015). It is well known that SDMs are affected by positional error and it has been recommended that positional errors should be minimised through careful field design and data processing (Osborne and Leitao, 2009; Gábor et al., 2019b). Moreover, it has been shown that models affected by positional error are difficult to detect (Osborne and Leitao, 2009; Mitchell et al., 2017). In addition, it is increasingly recognized in SDM that matching temporal resolution of ecological processes under study, environmental variables, and species occurrences is equally important as spatial resolution (Fernandez et al., 2017; Mannocci et al., 2017). Global databases have been a subject of tremendous development in the recent years, leading to increases in the number of records and better data quality reporting. On the other hand, some records may have been lost due to the move of global databases to Creative Commons licences for example.

Journal Pre-proof Exploring the potential challenges to the usability of global biodiversity databases can help identify aspects that will need to be improved in the coming decade to make those databases more usable in different contexts. In this study, we present a first assessment of the quality of marine mammal distribution data in 2015 and 2019 for two major global databases: GBIF and OBIS. We selected marine mammals because they (i) have relatively good data compared to other marine species, (ii) are charismatic/iconic species important for cultural ecosystem services, and (iii) are highly endangered

oo

f

and hence a focus for global conservation efforts. In particular, we (1) quantified some quality elements of data aggregated in GBIF and OBIS; (2) assessed the number of shared records between

pr

GBIF and OBIS; and (3) compared the 2015 and 2019 situation. We concentrate on quality issues that

e-

are particularly relevant to data fitness for use in marine biodiversity research (e.g. assessment of

Pr

species richness, species distribution modelling).

al

1.1. Species richness estimation

rn

Knowledge of species’ distributions largely comes from field observations that are often made

Jo u

available to science as geographic data of different types: point records (e.g., observations), polygons (e.g., range maps), and/or gridded data. While species distribution atlases have been compiled at various spatial resolutions (e.g., Gibbons et al., 2007), both point records and range maps are for practical reasons often gridded using varying cell sizes. This process can result in two typical types of data quality completeness error: Errors of omission (i.e. false negatives), where species are shown to be absent from a location while they are in fact present, and errors of commission (i.e. false positives), being the reverse. The geographic distribution of species point records is also often biased, due to uneven sampling and sharing of those data (Isaac and Pocock, 2015; Meyer et al., 2015). Converting point data to a grid can hence lead to errors of omission if the grid resolution is too fine.

Journal Pre-proof Range maps are also often used to assess global and regional species richness, to identify priority areas for conservation or identify conservation gaps, and to assess vulnerability of species to anthropogenic threats (e.g. Parravicini et al., 2014; Coll et al., 2015; Peters et al., 2015). Despite known challenges caused by arbitrary selections of scales (Jelinski and Wu, 1996), many ecological studies still ignore most range maps limitations, or even dispute those (e.g. Jenkins et al., 2013). For instance, local presences derived from the range maps are known to overestimate the true occupancy of species, especially at fine spatial grains, an issue that was shown to generate errors of

oo

f

commission (Rondinini et al., 2006, Powney and Isaac, 2015). Questions about which data to use to accurately estimate species richness has received a fair attention in terrestrial ecology (Graham and

pr

Hijmans, 2006; Rocchini et al., 2011; García-Roselló et al., 2015), but far less in marine ecological

e-

research (Williams et al., 2014), an environment harder to sample. To assess information gaps of

Pr

marine mammal distribution data we compared global mammal richness based on IUCN range maps with point samples available from global databases and assessed the number of potential false

Jo u

2. Material and methods

rn

al

presences from the perspective of range maps.

2.1. Data access and integration

OBIS (http://www.iobis.org) is an information system that mobilizes, integrates, quality controls, and publishes biogeography data about marine life (Costello et al., 2007; Fujioka et al., 2012; Vandepitte et al., 2015; De Pooter et al., 2017). As of December 2019, OBIS makes available through its portal over 57 million records. It is hence the largest primary provider of marine biogeograph ical information, but also one of the main providers of data to GBIF ( http://www.gbif.org). GBIF is an international organisation dedicated to provide free and easy to access biodiversity data (Edwards et al., 2000). As of December 2019, GBIF makes available over 1 billion georeferenced terrestrial and marine species records. In addition, we used maps of marine mammal species ranges that were

Journal Pre-proof originally published by the Global Mammals Assessment in 2008 and made available by the IUCN Red List of Threatened Species (http://www.iucnredlist.org/). We used version 2014.1 of the Marine mammals range maps, downloaded as vector polygons from the IUCN Red List website ( IUCN, 2014). Range maps display the limits of a species distribution. In other words, a species is likely to occur within the polygon, but not necessarily everywhere in it, nor does it have to be equally distributed within that polygon. The range maps are based on known occurrences of the species (i.e., mainly on species sightings and captures found in existing databases) and convex-hull techniques, on habitat

oo

f

requirements, elevation (depth) limits, distance to coastline, climate restrictions and other expert knowledge of the species and its range. Note, however, that knowledge about species-preferred

pr

habitat and habitat data itself are lacking in marine environment (e.g. character of the seabed

e-

bottom and associated habitats). Furthermore, for delimiting range maps, experts can use other

Pr

crowdsourced data (Wood et al. 2015) or data that are not freely available to everybody (e.g. species landings for particular countries). However, with the current methods used to produce range maps, it

rn

al

this difficult to trace back the data used in their construction (e.g. Boitani et al. 2011).

Data for all marine mammal species were downloaded individually from GBIF (August 2015 and

Jo u

October 2019) and OBIS (July 2015 and October 2019) portals. We first selected 120 out of 128 living marine mammals for our study. From the search, we excluded: the recently described species narrow-ridged finless porpoise (Neophocaena asiaeorientalis; Jefferson and Wang, 2011), Australian humpback dolphin (Sousa sahulensis; Jefferson and Rosenbaum, 2014) and Deraniyagala's beaked whale (Mesoplodon hotaula; Dalebout et al., 2014); a possibly extinct Chinese river dolphin (Lipotes vexillifer); marine otter (Lontra felina) for which occurrences were not available; Indian Ocean humpback dolphin (Sousa plumbea) for which a range map was not available; Caspian seal (Pusa caspica) due to its limited geographic range; and polar bear (Ursus maritimus) due to its specific and limited dominant habitat (i.e., sea ice). Minor differences in OBIS, GBIF and IUCN range maps taxonomies for selected species were addressed as follows. Data on Juan Fernandez fur seals

Journal Pre-proof (Arctocephalus philippii) and Guadalupe fur seal (Arctocephalus townsendi) were merged, the latter being now considered a subspecies of the first. Data on South American fur seal (Arctocephalus australis) and New Zealand fur seal (Arctocephalus forsteri) were merged, the second one being now considered a subspecies of the first. Data on short-beaked common dolphin (Delphinus delphis) and long-beaked common dolphin (Delphinus capensis) were merged, the second one being now considered a subspecies of the first. We customized our search for download from GBIF using available filters, excluding records describing fossilized specimens and only including records with no

oo

f

known coordinate issues. Only occurrences with accepted names were used (see Supplementary material S1 and S2 for list of species containing DOI and datasets included respectively).

pr

2.2. Data quality assessment and data sharing between global aggregators

e-

We assessed six types of quality or usability issues for both GBIF and OBIS: (1) data records with

Pr

geographical coordinates of 0°N and 0°E commonly found in databases and resulting from errors in the recording of geographic coordinates; (2) data records with identical values for the longitude and

al

latitude (e.g. 13.256°N, 13.256°E); and (3) multiple entries of a species observation at a same

rn

geographic location and time within a same database (i.e., duplicate records within a same

Jo u

database). We assumed records to be duplicates when both spatial (latitude, longitude) and temporal (day, month, year) attributes of the records were identical. While this approach for defining duplicates is commonly used in the literature (e.g. Gaiji et al., 2013; Mesibov, 2013), it can be too simple so we refer to those records as potential duplicates; (4) we used rounding of records coordinates (i.e. the level of precision at which the geographic location is recorded in the database, such as 12.2634°N vs 12.26°N) as estimate of their positional accuracy. When the number of decimals differed between the coordinates (i.e. latitude and longitude), we always considered the coordinate with the greater number of decimals. In addition, for year 2019, we also assessed the geographic precision of records reported in global databases (i.e. ‘coordinateUncertaintyInMeters’ and ‚coordinatePrecision’ attributes in GBIF and OBIS, respectively; these attributes were not

Journal Pre-proof available in 2015); (5) locations that fall outside the species’ known habitat (i.e. in this case, marine species records overlapping terrestrial habitats). To distinguish between marine and terrestrial environments, we used the GSHHG (Global Self-consistent, Hierarchical, High-resolution Geography database) shoreline dataset (version 2.3.4. Jan 1, 2015) (Wessel and Smith, 1996). While such an issue is usually easy to handle for terrestrial species by simply deleting all locations falling beyond the shoreline for further analyses (e.g. García-Roselló et al., 2014), the situation for marine species can be more complex. First, some marine mammals spend part of their life in terrestrial environments

oo

f

(e.g., Arctocephalus spp.) or freshwater habitats (e.g., Sotalia spp.). Second, marine mammals can be observed from the shore and observers’ locations are sometimes used to locate the records (e.g.,

pr

Mesoplodon spp.). We therefore distinguished species that strictly occupy marine environment from

e-

species that also use freshwater or terrestrial habitats. We however did not account for the rounding

Pr

of the records coordinates, which may lead to slight overestimations of habitat mismatch. Finally, (6) we measured the level of incompleteness of the date of collection attribute.

al

Additionally, we quantified the number of shared records between GBIF and OBIS in order to assess

rn

the complementarity of those two databases. Identification of common records was based only on their spatial location, because current practices typically do not preserve linkages between records of

Jo u

different databases (see Guralnick et al., 2015 for a discussion on the important topic of globally unique identifiers). We considered records to be identical when they shared the exact same location (i.e. same latitude and longitude coordinates). Some of those quality issues are not necessarily created by data aggregators as data might have been submitted like this by data providers. They however point to challenges that could have been prevented at the integration (i.e. issues 1, 2 and 3) or should be addressed at the user-level (i.e. issues 4, and 5) as they can impact the usability of those data and hence the quality of the analyses based on those datasets. We selected those specific quality issues because some of those criteria have been found to seriously influence species richness estimates ( Maldonado et al., 2015) and occurrences suffering from those types of issues are typically assessed and removed from

Journal Pre-proof biogeographical and macroecological studies (e.g. García-Roselló et al., 2015; Gueta and Carmel, 2016; Watcharamongkol et al., 2018). While such data cleaning can be easily done by any GIS software, specialised tools have been developed, for example CoordinateCleaner (Zizka et al. 2019) or Biogeo package (Robertson et al. 2016). The list of issues assessed is not comprehensive and other issues that cannot be detected through semi-automated data screening as performed here are common in databases (see Chapman, 2005; Meyer et al. 2016). Analyses were performed in the statistical software R, version 3.6.0, and using ArcGIS 10.5.1 (see scripts and models in

oo

f

Supplementary material S3).

pr

2.3. Comparing species richness

e-

To identify potential information gaps in marine mammal distributions we overlaid IUCN range maps

Pr

and species occurrences from global databases with two nested grids (at resolutions: 1° and 5°) and summed the number of species occurring in each grid cell (i.e., every species f or which at least part

al

of the range map overlapped with the cell or every species for which a record in global databases

rn

exists, respectively). To assess the correspondence between species richness estimated using range

Jo u

maps and known species occurrences, we used scatterplots with 1:1 line and calculated the number of cells with false positives (i.e. commission error). In these cells, species are assumed to occur based on their range map while no record of the species exists in any of the two global databas es analysed. For this analysis, we removed all duplicates and we used records from both databases that were made after 1950 and that are not living specimens (i.e. records from zoos and aqua parks). In addition, to avoid mismatch between occurrences and grid cell location due to the occurrences rounding or locational shifting, we removed records with coordinates specified with less than one decimal and with coordinates precision worse than 50 km.

Journal Pre-proof 3. Results In 2015, the GBIF dataset contained 644,748 records of marine mammals (ranging from 2 to 89,660 records per species, with an average of 5,373 records per species) and the OBIS dataset contained 751,833 records of marine mammals (ranging from 0 to 233,826 records per species, with an average of 6,835 records per species). The number of records increased considerably between 2015 and 2019. In 2019, the GBIF dataset contained 766,834 records of marine mammals (ranging from 4 to 105,162 records per species, with an average of 6,390 records per species) and the OBIS dataset

oo

f

contained 1,138,134 records of marine mammals (ranging from 0 to 234,717 records per species, with an average of 9,984 records per species). Please see Supplementary material S4 for the number

pr

of records of individual species for each database and year, respectively. Analyses revealed four

e-

issues related to the quality of the OBIS and GBIF data or their aggregation, risking to impact the

Pr

fitness of the data for a given use: (1) data with missing collection dates, (2) records that fall outside

al

the species habitat, (3) duplicate records within a single dataset, and (4) coordinates rounding.

rn

3.1. Date of collection, species habitat, potential duplicates, rounding and shared records

Jo u

In 2015, 23% of GBIF records and 1% of OBIS records did not have the date of data collection completed, an information required for many types of ecological analyses (Figure 1a, b). The situation has greatly improved in 2019, with less than 2% of the records missing a data collection date in both databases. We also found that in both years, between 11 and 21% of GBIF and of OBIS marine mammal records fell outside the species habitat (i.e., marine species found on dry land), highlighting potential positional accuracy problems. When species strictly marine and species that partly use terrestrial habitats were considered separately, the amount of species records that fell outside marine habitat, in 2019, was 17% of strictly marine species and 20% of “terrestrial” species in GBIF, and 11% of strictly marine species and 12% of “terrestrial” species in OBIS.

Journal Pre-proof Analyses of duplicate records within each individual database (i.e., records of a same species within a same database for which latitude, longitude, and temporal attributes are all identical) indicate that nearly 40% of OBIS and GBIF records in 2015 were potential duplicate records (Figure 1a, b). In 2019, the situation regarding to potential duplicate records is nearly the same in OBIS (37% of records are potential duplicates) but considerably improved in GBIF ( only 19% of records are potential duplicates). After removing potential duplicates from each individual database, only approximately 45% of records of GBIF and 70% in OBIS in 2015 were encoded using more than three decimals (i.e.

oo

f

remaining records have locational accuracy >100m depending on the number of decimals used to encode the coordinates and on their location on the Earth; see Figure 2 and Table 1 for explanation

pr

of the effect of decimals rounding on positional accuracy). While the situation has not changed in

e-

OBIS for the 2015-2019 period, the number of records in GBIF encoded using more than tree

Pr

decimals has increased to almost 75% (Figure 2b). A review of the coordinate precision of GBIF (i.e. ‘coordinateUncertaintyInMeters’ attribute) and OBIS (i.e. ‘coordinatePrecision’ attribute) records

al

show that those attributes are not documented for 46% (GBIF) and 30% (OBIS) of the records. When

Jo u

(Figure 3).

rn

documented, the accuracy is below approximately 1 km for 14% (GBIF) and 55% (OBIS) of the records

e-

pr

oo

f

Journal Pre-proof

Pr

Figure 1. Marine mammal records not characterised by the two quality issues (i.e. has date of collection and fell inside marine habitat) and exempt of possible duplicates in the Global Biodiversity

al

Information Facility (GBIF) and Ocean Biogeographic Information System (OBIS) databases . (a, b)

rn

Percentage of records in GBIF (black) and OBIS (blue) that do not suffer from the two quality issues

Jo u

studied and do not have potential duplicate records in the same database in 2015 and 2019, respectively. Note that approximately 45% to 66% of records in both databases are unique and do not suffer from any quality issues in the two years analysed. (c, d) Percentage of individual and shared records between GBIF and OBIS after removing duplicate records in 2015 and 2019, respectively. Note that estimates are based only on geographic coordinates and may slightly vary because of repeated records at the same location and because of rounding of the coordinates (see Supplementary material S3).

Jo u

rn

al

Pr

e-

pr

oo

f

Journal Pre-proof

Figure 2. Geographic coordinate rounding (i.e. the number of decimal places) of marine mammal records in GBIF (a) and OBIS (b) in 2015 (black) and 2019 (blue).

pr

oo

f

Journal Pre-proof

e-

Figure 3. Geographic coordinates precision of marine mammal records in GBIF (black; i.e.

Pr

‘coordinateUncertaintyInMeters’ attribute) and OBIS (blue; i.e. ‘coordinatePrecision’ attribute) in

al

2019. Note that such attributes were not available in 2015.

rn

Table 1. Effects of coordinates rounding Accuracy at the equator [m]

What can approximately be identified

Accuracy of GPS measurements

0

111 320

ocean or sea

-

1

11 132

-

2

1 113.20

3

111.32

one reserve from another one beach or island from another one colony from another

4

11.132

-

5

1.1132

one individual from another

Limit of consumer grade uncorrected GPS accuracy with no interference GPS with differential correction, sometimes referred to as GIS GPS

6

0.11132

-

-

≥7

≤ 0.011132

-

geodetic GPS used for surveying; near limit of what GPS-based techniques can achieve

Jo u

Number of decimal places

-

Journal Pre-proof Finally, we found a high number of data records available only in one of the two databases (Figure 1c, d). After removing potential duplicates in each individual database, only 19% of records in 2015 and 11% of records in 2019 were common to both GBIF and OBIS (i.e., 135,885 records in 2015 and 133,882 records in 2019). Records common to both GBIF and OBIS tend to be in areas already characterized by higher data density (Figure 4). In addition, we found that in some cases, the relatively low number of shared records was caused by coordinates rounding in one of the databases. As slight differences in latitude and longitude caused by rounding may occur, we tested

oo

f

several buffers up to a distance of 100 m. When using a ten meters buffer (which roughly corresponds with rounding to four decimals), the proportion of shared records increased to 29% in

pr

2015 and 18% in 2019, numbers that did not change when using larger buffer sizes. Most of the

e-

individual species show only small increase in shared records when the buffers were applied, but, for

Pr

example, Antarctic fur seal (Arctocephalus gazella) and Weddell seal (Leptonychotes weddellii) show considerable increase from a few to thousands of records shared (see supplementary material S5 for

Jo u

rn

al

number of shared records of individual species).

Journal Pre-proof Figure 4. Density of data records in OBIS, GBIF, and shared records at 5° grid-cells on a scale from low (light blue) to high (dark blue) in 2015 and 2019, respectively. Data records exclude potential duplicates. We used records located within a 10 m buffer to account for coordinates rounding. Maps used a Cylindrical Equal Area Projection (WKID: 54034).

3.2. Global distribution of marine mammals

oo

f

Our results show strong underestimation of marine mammal richness in global databases compared to richness based on expert-based species range maps (Figure 5). Even at a 5° resolution, richness

pr

estimates based on range maps are in most cases higher than the global databases (Figure 5a). The

e-

nine areas (Figure 5b) of high marine mammal richness reported by Pompa et al. (2011) showed some of the highest rates of potential false positives (i.e. number of species that are thought to occur

Pr

in each cell based on their range map, but for which no species record exist in GBIF and OBIS; Figure

Jo u

rn

al

5c).

rn

al

Pr

e-

pr

oo

f

Journal Pre-proof

Jo u

Figure 5. (a) Relationship between species richness per pixel based on IUCN range maps vs point occurrences in GBIF and OBIS databases at 5° and 1° in 2019. The solid line indicates y = x. (b) Marine mammal richness (i.e. number of species per 5x5° cell). Names refer to nine areas identified as marine mammal diversity hotspots by Pompa et al. (2011). (c) Potential false positives, i.e. the number of species that are thought to occur in each cell based on their range map but for which no species record exists in either GBIF or OBIS (in 2015, the distribution of potential false positives were almost the same). The number of potential false presences in each cell is shown in the legend on the left. Note the high rates of potential false positives in areas identified as marine mammal d iversity hotspots. Maps used a Cylindrical Equal Area Projection (WKID: 54034).

Journal Pre-proof 4. Discussion 4.1. Quality of global biodiversity databases The two largest public marine biodiversity database, GBIF and OBIS, contain very large amounts of biodiversity records. Efforts to improve the quality of those databases are clear, as our analyses have not found some types of errors reported by previous studies (e.g. occurrences with geographical coordinates of 0° longitude and 0° latitude; Otegui et al., 2013). In addition, some potential issues

f

have been improved upon between 2015 and 2019 (e.g. less duplicate records, less records that did

oo

not have date of collection completed). However, our study points to a number of additional data

pr

quality and usability challenges that should be considered to ensure that data is fit for a given use in ecological research and prevent misuse of the data. While the analyses presented in this paper were

e-

only conducted for marine mammals, similar issues are likely to be found for other species. However,

Pr

marine mammals are typically observed from a distance and hence their observations are more prone to spatial error in comparison to other marine species (e.g. those caught in nets). It is

al

therefore difficult to generalize our findings to other species and further studies will be required to

Jo u

rn

quantify quality issues accurately for different marine groups.

4.1.1. Records outside species habitat, coordinate rounding, and absence of date of collection Records that fall outside the species habitat can be real observations of marine mammals that spend part of their life in terrestrial environments, but can also be caused by errors on the geographic coordinates (e.g. observers’ locations are sometimes used to locate species’ records), low precision of the coordinates (e.g., rounding to the nearest degree or minute) or even taxonomic errors (Robertson, 2008). The uncertainty in location caused by the rounding of coordinates should be always considered in the data cleaning process (see for example Watcharamongkol et al., 2018 who removed data with precision fewer than three decimal places). Most of the records in both GBIF and OBIS are rounded to four or more decimals (Figure 2) and thus suitable for regional studies at

Journal Pre-proof relatively fine spatial resolutions. Local studies (e.g. of sedentary species) may benefit from such spatially accurate data, assuming that the data really have such accuracy. However, this is difficult to assess as coordinates precision attributes are often empty in both databases ( Figure 3). In addition, we found evidence that the number of decimals differ for the same observation between the two databases (see supplementary material S5). In contrast, such accuracy is not necessary/reasonable for large, highly mobile marine mammals. We doubt that any of the records of marine mammals was recorded with accuracy higher than one meter and discourage from rounding to seven and more

oo

f

decimals (see also Mesibov, 2013 for discussion of rounding error in global databases). One should also note that while the GSHHG dataset is a coastline dataset of high spatial accuracy, it is still a

pr

simplified representation of the coast; therefore false negatives could exist for species occurrence

e-

recorded very close to the coast.

Pr

We found less than 2% of records without date of collection in OBIS in both years. In addition, we found a considerable improvement in recording the date of collection between 2015 and 2019 in

al

GBIF. This considerably improves the usefulness of the data when studying the temporal component,

rn

such as species migrations and biogeographic changes. Furthermore, the marine environment is highly dynamic and species disperse and migrate over large distances and interact with dynamic

Jo u

oceanographic processes that vary at time-scales from seconds to decades (Mannocci et al., 2017). These processes may vary yearly, seasonally, monthly or even weekly and it is therefore important to have dates of collection recorded with best temporal resolution possible (e.g. day of observation; Fernandez et al., 2017; Mannocci et al., 2017).

4.1.2. Potential duplicates While not a quality issue related to individual data records, duplicate records can greatly impact the database usability as they increase the risk of having the data be misused in the absence of a careful pre-treatment of the data before conducting an analysis (e.g. Hijmans and Elith, 2018). We found a high number of potential duplicate records. Nearly 37% and 20% of OBIS and GBIF records,

Journal Pre-proof respectively, are potential duplicate records in 2019 (note that in 2015, this was almost 40% in both databases). We considered records to be potential duplicates when both spatial (latitude, longitude) and temporal (day, month, year) attributes of the records were identical. We assume that it is highly unlikely that one species was recorded with exactly same coordinates twice or even more times on the same day. Therefore, it is likely that duplicate records result from a same dataset/record being published more than once. It is therefore a question how many unique records is really available in global databases. However, other causes can exist, such as deliberate entries of distinct observations

oo

f

made at a same location and time, or the concurrent recording of several individuals of a same species (Mesibov, 2013; Nelson et al. 2018). For example, Gaiji et al. (2013) reported approximately

pr

10% of potential duplicates records in GBIF across all the species. Unfortunately, identifying the

e-

cause that led to a duplicate record is difficult and can be complicated by incompl eteness of

Pr

temporal attributes (i.e. date of collection left as null). Two types of quality issues may occur when duplicates are identified: (1) values for temporal attributes were not completed and thus two or

al

more records from a single location are considered as being potential duplicates even when they are

rn

not, and (2) one record has a temporal attribute while another one does not. Although those are also true potential duplicates, they may not be identified as such. Potential duplicate records influence

Jo u

some metrics used to evaluate the completeness of global databases, such as the number of records per unit area, values used to identify data gaps (e.g. Troia and McManamay 2016), and impacting ‘range fit’ calculations (i.e., the proportion of presence records that can be found within a species range polygon - see Ficetola et al., 2014).

4.1.3. Data sharing between the aggregators Researchers often have to spend considerable efforts to gather species occurrence data from various sources and in different formats (Franklin et al., 2017; Saeedi et al., 2017). While the ultimate goal of data aggregation is to provide users with the ability to download data from one single location, our results show that there is a large difference in the data available through OBIS and GBIF.

Journal Pre-proof Approximately 135,000 records were shared between OBIS and GBIF in both years. When we considered possible differences in coordinates rounding, this number increased to 190,000 and 220,000 in 2015 and 2019, respectively. The relatively small difference in the number of shared records between 2015 and 2019 suggests that the increase in the number of database records (more than 600,000 and 100,000 records added to OBIS and GBIF, respectively) comes from different data providers. Despite the many benefits brought by databases like GBIF and OBIS, our results highlight the importance of ongoing efforts to seek new contributors (e.g. regional databases), to incorporate

oo

f

new data sources (e.g. open-access repositories) and especially the need for an increased cooperation between GBIF and OBIS (signed in October 2014) to avoid duplication of efforts ( IODE

pr

Steering Group for OBIS, 2015; Sikes et al., 2016; Bingham et al., 2017). For example it could be

e-

recommended that marine data should systematically be submitted to OBIS before they are added

Pr

into GBIF. To improve the usability of the data, data aggregators should not only simply publish data submitted by contributors, but should also ensure that data meet higher quality criteria, are

Jo u

rn

aggregators.

al

aggregated more carefully using quality control routines and are better shared with other

4.2. Global patterns of marine mammal diversity Almost all available data in both databases were used for an assessment of global patterns of marine mammal diversity at 1° and 5° resolutions as only very few records were rounded to integer numbers or have positional uncertainty higher than half of the grids’ resolution. The strong spatial underrepresentation of marine mammals in global databases when compared to expert-based species range maps may result from the known higher spatial survey effort in the northern hemisphere for cetaceans (Kaschner et al., 2012) and the higher participation in data-sharing networks by some countries (Meyer et al., 2015). Such bias in global databases hampers their broader use in biodiversity research (Peterson and Soberón, 2018). Although methods helping account for sampling

Journal Pre-proof bias exist (Chaudhary et al., 2017), further integration of existing data worldwide should be prioritized (Meyer et al., 2015, 2016). In the absence of better species distribution data in the form of point records, range maps can help estimate marine species richness. It is increasingly common for studies to combine range maps with global database records (e.g. Asaad et al., 2018). We also already mentioned that current data paucity encourages the uses of SDM, such as AquaMaps, providing an interesting alternative to the datasets assessed in our study. The AquaMaps range maps based on OBIS and GBIF data are

oo

f

produced at 0.5° resolution and have been recently compared to IUCN range maps by O’Hara et al. (2017). Marine studies tend to use gridded species range maps and point records at resolutions that

pr

vary from 0.1° to 5°. For example, Coll et al. (2012, 2015) used a 0.1 degrees spatial resolution in

e-

their search for areas valuable for conservation, which at the latitude of the Mediterranean Sea is

Pr

approximately 9 x 11 km. But the resolution at which range maps can deliver an appropriate representation of marine species diversity remains a question. To avoid high rates of commission

al

error, it was recommended that terrestrial studies assessing diversi ty patterns use range maps at

rn

roughly 100 km grid resolution and coarser (Hurlbert and Jetz, 2007; Hawkins et al., 2008). No such

Jo u

guideline exists for the marine environment and appropriate resolution when using range maps is yet to be suggested. However, marine environment studies that openly acknowledge range maps limitations used much coarser spatial resolution (e.g., Parravicini et al., 2014 used 5° grid). We suggest, despite the existing differences, that lessons should be learned from existing terrest rial studies and discourage from using small grid resolutions ( Hurlbert and Jetz, 2007; Hawkins et al., 2008).

4.3. Are global databases records fit for use in marine biodiversity research? The data quality and usability problems identified in this study, if not corrected, can prove acceptable for some data users but not for others, depending on the specific analyses to be conducted ( Belbin et

Journal Pre-proof al., 2013). For instance, one of the major uses of aggregated data in marine biodiversity research is SDM. Users are increasingly concerned about the importance of spatial data quality for SDM applications (Moudry et al., 2017; Lecours et al., 2017; Moudry et al., 2018; Simova et al., 2019; Araújo et al., 2019). A recent report on GBIF data fitness for use in SDM highlighted that GBIF data cannot be used in SDM without prior data cleaning (Anderson et al., 2016) and producing SDM using data solely from global aggregators presents risks (e.g. Ferro and Flick, 2015). Fitness for purpose assessment has been facilitated by recent progress in quality control (see Vandepitte et al., 2015 for

oo

f

quality control flags) and data are processed by aggregators in an effort to correct or flag data errors. However, incorrect data processing has been documented ( Mesibov, 2018). While no data user

pr

should trust aggregated data blindly (see discussion by Franz and Sterner, 2018), an unnecessary risk

e-

of data misuse remains and could often be avoided. Current ad hoc data cleaning approaches result

Pr

in an unnecessary duplication of efforts that could be prevented. Recently, Veiga et al. (2017) proposed a conceptual framework that allows users to document fitness for use of data and helps

al

data aggregators and providers improve data products and make a more responsible sharing of the

rn

data and decrease the duplication of efforts. While data aggregators have traditionally seen their role as being limited to the archiving of data, putting the burden of assessing how data fit a specific use

Jo u

on the users, it is increasingly recognized that data quality is a shared responsibility of users, providers, and aggregators (e.g. Belbin et al., 2013; Anderson et al., 2016). Giving users a possibility to communicate quality issues to aggregators and providers can significantly improve the quality of shared data (Franz and Sterner, 2018). Improving existing data entry mechanisms, quality control routines, and data exchange between aggregators should help make those databases more useful to the community and reduce the risks of the misuse of biological data.

REFERENCES Anderson, R. P., Araújo, M., Guisan, A., Lobo, J. M., Martínez-Meyer, E., Peterson, A. T., & Soberón, J. (2016). Final report of the task group on GBIF data fitness for use in distribution modelling. Global Biodiversity Information Facility, Geneva http://www. gbif. org/resource/82612 Google Scholar.

Journal Pre-proof Amano, T., Lamming, J.D.L. & Sutherland, W.J. (2016) Spatial Gaps in Global Biodiversity Information and the Role of Citizen Science. BioScience, 66, 393–400. Amano, T. & Sutherland, W. (2013) Four barriers to the global understanding of biodiversity conservation: wealth, language, geographical location and security. Proc. R. Soc. B, 280. Araújo, M. B., Anderson, R. P., Barbosa, A. M., Beale, C. M., Dormann, C. F., Early, R., Garcia, R. A., Guisan, A., Maiorano, L., Naimi, B., O’Hara, R. B., Zimmermann, N. E., Rahbek, C. (2019). Standards for distribution models in biodiversity assessments. Science Advances, 5, eaat4858. Asaad, I., Lundquist, C. J., Erdmann, M. V. & Costello, M. J. (2018) Delineating priority areas for marine biodiversity conservation in the Coral Triangle. Biological Conservation, 222, 198-211.

oo

f

Beck, J., Böller, M., Erhardt, A., & Schwanghart, W. (2014) Spatial bias in the GBIF database and its effect on modeling species' geographic distributions. Ecological Informatics, 19, 10-15.

pr

Belbin, L., Daly, J., Hirsch, T., Hobern, D. & Salle, J. La (2013) A specialist’s audi t of aggregated occurrence records: An “aggregator”s’ perspective. ZooKeys, 76, 67–76.

Pr

e-

Bingham H, Doudin M, Weatherdon L, Despot-Belmonte K, Wetzel F, Groom Q, Lewis E, Regan E, Appeltans W, Güntsch A, Mergen P, Agosti D, Penev L, Hoffmann A, Saarenmaa H, Geller G, Kim K, Kim H, Archambeau A, Häuser C, Schmeller D, Geijzendorffer I, García Camacho A, Guerra C, Robertson T, Runnel V, Valland N, Martin C (2017) The Biodiversity Informatics Landscape: Elements, Connections and Opportunities. Research Ideas and Outcomes 3: e14059. https://doi.org/10.3897/rio.3.e14059

rn

al

Boitani, L., Maiorano, L., Baisero, D., Falcucci, A., Visconti, P., & Rondinini, C. (2011) What spatial data do we need to develop global mammal conservation strategies?. Philosophical Transactions of the Royal Society B: Biological Sciences, 366, 2623-2632.

Jo u

Chapman, A. D. (2005) Principles of data quality, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen, Denmark. ISBN 87-92020-03-8. Chaudhary, C., Saeedi, H., & Costello, M. J. (2017) Marine Species Richness Is Bimodal with Latitude: A Reply to Fernandez and Marques. Trends in Ecology & Evolution, 32, 234-237. Coll, M., Piroddi, C., Albouy, C., Ben Rais Lasram, F., Cheung, W.W.L., Christensen, V., Karpouzi, V.S., Guilhaumon, F., Mouillot, D., Paleczny, M., Palomares, M.L., Steenbeek, J., Trujillo, P., Watson, R. & Pauly, D. (2012) The Mediterranean Sea under siege: spatial overlap between marine biodiversity, cumulative threats and marine reserves. Global Ecology and Biogeography, 21, 465– 480. Coll, M., Steenbeek, J., Ben Rais Lasram, F., Mouillot, D. & Cury, P. (2015) “Low -hanging fruit” for conservation of marine vertebrate species at risk in the Mediterranean Sea. Global Ecology and Biogeography, 24, 226–239. Corrigan, C.M., Ardron, J. a., Comeros-Raynal, M.T., Hoyt, E., Notarbartolo Di Sciara, G. & Carpenter, K.E. (2014) Developing important marine mammal area criteria: learning from ecologically or biologically significant areas and key biodiversity areas. Aquatic Conservation: Marine and Freshwater Ecosystems, 24, 166–183.

Journal Pre-proof Costello, M.J., Stocks, K., Zhang, Y., Grassle, J.F. & Fautin, D.G. (2007) About the Ocean Biogeographic Information System. ( Accessed 12.02.17).

Dalebout, M.L., Scott Baker, C., Steel, D., Thompson, K., Robertson, K.M., Chivers, S.J., Perrin, W.F., Goonatilake, M., Charles Anderson, R., Mead, J.G., Potter, C.W., Thompson, L., Jupiter, D. & Yamada, T.K. (2014) Resurrection of Mesoplodon hotaula Deraniyagala 1963: A new species of beaked whale in the tropical Indo-Pacific. Marine Mammal Science, 30, 1081–1108.

oo

f

De Pooter, D., Appeltans, W., Bailly, N., Bristol, S., Deneudt, K., Eliezer, M., Fujioka, E., Giorgetti, A., Goldstein, P., Lewis, M., Lipizer, M., Mackay, K., Marin, M., Moncoiffé, G., Nikolopoulou, S., Provoost, P., Rauch, S., Roubicek, A., Torres, C., Van de Putte, A., Vandepitte, L., Vanhoorne, B. , Vinci, M., Wambiji, N., Watts, D., Klein Salas, E. & Hernandez, F. (2017) Toward a new data standard for combined marine biological and environmental datasets - expanding OBIS beyond species occurrences. Biodiversity Data Journal, 5, e10989.

pr

Duputié, A., Zimmermann, N. E. & Chuine, I. (2014). Where are the wild things? Why we need better data on species distribution. Global Ecology and Biogeography, 23, 457-467.

e-

Edwards, J. L., Lane, M. A. & Nielsen, E. S. (2000) Interoperability of biodiversity databases: biodiversity information on every desktop. Science, 289, 2312-2314.

Pr

Fernandez, M., Yesson, C., Gannier, A., Miller, P. I. & Azevedo, J. M. (2017) The importance of temporal resolution for niche modelling in dynamic marine environments. Journal of biogeography, 44, 2816-2827.

rn

al

Ferro, M. L. & Flick, A. J. (2015) “Collection Bias” and the Importance of Natural History Collections in Species Habitat Modeling: A Case Study Using Thoracophorus costalis Erichson (Coleoptera: Staphylinidae: Osoriinae), with a Critique of GBIF.org. The Coleopterists Bulletin, 69, 415-425.

Jo u

Ficetola, G.F., Rondinini, C., Bonardi, A., Katariya, V., Padoa-Schioppa, E. & Angulo, A. (2014) An evaluation of the robustness of global amphibian range maps. Journal of Biogeography, 41, 211– 221. Franklin, J., Serra-Diaz, J.M., Syphard, A.D. & Regan, H.M. (2017) Big data for forecasting the impacts of global change on plant communities. Global Ecology and Biogeography, 26, 6–17. Franz, N. M. & Sterner, B. W. (2018) To increase trust, change the social design behind aggregated biodiversity data. Database, 2018, 1-12. Fujioka, E., Berghe, E. Vanden, Donnelly, B., Castillo, J., Cleary, J., Holmes, C., McKnight, S. & Halpin, P. (2012) Advancing Global Marine Biogeography Research with Open-source GIS Software and Cloud Computing. Transactions in GIS, 16, 143–160. Gábor, L., Moudrý, V., Barták, V., & Lecours, V. (2019a) How do species and data characteristics affect species distribution models and when to use environmental filtering?. International Journal of Geographical Information Science, 1-18.

Journal Pre-proof Gábor, L., Moudrý, V., Lecours, V., Malavasi, M., Barták, V., Fogl, M., Šímová, P., Rocchini, D., & Václavík, T. (2019) The effect of positional error on fine scale species distribution models increases for specialist species. Ecography.

Gaiji, S., Chavan, V., Ariño, A.H., Otegui, J., Hobern, D., Sood, R. & Robles, E. (2013) Content assessment of the primary biodiversity data published through GBIF network: status, challenges and potentials. Biodiversity informatics, 8, 94–172. García-Roselló, E., Guisande, C., Heine, J., Pelayo-Villamil, P., Manjarrés-Hernández, A., González Vilas, L., González-Dacosta, J., Vaamonde, A. & Granado-Lorencio, C. (2014) Using modestr to download, import and clean species distribution records. Methods in Ecology and Evolution, 5, 708–713.

pr

oo

f

García-Roselló, E., Guisande, C., Manjarrés-Hernández, A., González-Dacosta, J., Heine, J., PelayoVillamil, P., González-Vilas, L., Vari, R.P., Vaamonde, A., Granado-Lorencio, C. & Lobo, J.M. (2015) Can we derive macroecological patterns from primary Global Biodiversity Information Facility data? Global Ecology and Biogeography, 24, 335–347.

e-

Gibbons, D.W., Donald, P.F., Bauer, H.-G., Fornasari, L. & Dawson, I.K. (2007) Mapping avian distributions: the evolution of bird atlases. Bird Study, 54, 324–334.

Pr

Goodwin, Z. A., Harris, D. J., Filer, D., Wood, J. R. & Scotland, R. W. (2015) Widespread mistaken identity in tropical plant collections. Current Biology, 25, R1066-R1067.

al

Graham, C. & Hijmans, R. (2006) A comparison of methods for mapping species ranges and species richness. Global Ecology and Biogeography, 15, 578–587.

rn

Gueta, T. & Carmel, Y. (2016) Quantifying the value of user-level data cleaning for big data: A case study using mammal distribution models. Ecological informatics, 34, 139-145.

Jo u

Guralnick, R.P., Cellinese, N., Deck, J., Pyle, R.L., Kunze, J., Penev, L., Walls, R., Hagedorn, G., Agosti, D., Wieczorek, J., Catapano, T. & Page, R.D.M. (2015) Community next steps for making globally unique identifiers work for biocollections data. ZooKeys, 133–54. Halpern, B.S., Frazier, M., Potapenko, J., Casey, K.S., Koenig, K., Longo, C., Lowndes, J.S., Rockwood, R.C., Selig, E.R., Selkoe, K. a & Walbridge, S. (2015) Spatial and temporal changes in cumulative human impacts on the world’s ocean. Nature communications, 6, 7615. Halpern, B.S., Walbridge, S., Selkoe, K. a, Kappel, C. V, Micheli, F., D’Agrosa, C., Bruno, J.F., Casey, K.S., Ebert, C., Fox, H.E., Fujita, R., Heinemann, D., Lenihan, H.S., Madin, E.M.P., Perry, M.T., Selig, E.R., Spalding, M., Steneck, R. & Watson, R. (2008) A global map of human impact on marine ecosystems. Science (New York, N.Y.), 319, 948–52. Hawkins, B. A., Rueda, M. & Rodríguez, M.Á. (2008) What Do Range Maps and Surveys Tell Us About Diversity Patterns? Folia Geobotanica, 43, 345–355. Hijmans, R. J., & Elith, J. (2018). Species Distribution Modeling, http://rspatial.org/sdm/ accessed 29th of August.

Journal Pre-proof

Hurlbert, A. & Jetz, W. (2007) Species richness, hotspots, and the scale dependence of range maps in ecology and conservation. Proceedings of the National Academy of Sciences, 104, 13384 – 13389. Isaac, N. & Pocock, M. (2015) Bias and information in biological records. Biological Journal of the Linnean Society, 115, 522–531. IODE Steering Group for OBIS (SG-OBIS). (2015). Fourth Session, 10-12 February 2015 Reports of Meetings of Experts and Equivalent Bodies, UNESCO 2015 (English), UNESCO, 24 pp. IUCN 2014. The IUCN Red List of Threatened Species. Version 2014.1. http://www.iucnredlist.org. Downloaded on 27.8. 2015.

oo

f

Jefferson, T.A. & Rosenbaum, H.C. (2014) Taxonomic revision of the humpback dolphins ( Sousa s pp.), and description of a new species from Australia. Marine Mammal Science, 30, 1494–1541.

pr

Jefferson, T.A. & Wang, J.Y. (2011) Revision of the taxonomy of finless porpoises (genus Neophocaena): the existence of two species. Journal of Marine Animals and Their Ecology, 4, 3– 16.

e-

Jelinski, D. & Wu, J. (1996) The modifiable areal unit problem and implications for landscape ecology. Landscape ecology, 11, 129–140.

Pr

Jenkins, C. N., Pimm, S.L. & Joppa, L.N. (2013) Global patterns of terrestrial vertebrate diversity and conservation. Proceedings of the National Academy of Sciences, 110, E2602 – E2610.

al

Jetz, W., McPherson, J. & Guralnick, R. (2012) Integrating biodiversity distribution knowledge: toward a global map of life. Trends in ecology & evolution, 27, 151–9.

rn

Kaschner, K., Quick, N.J., Jewell, R., Williams, R. & Harris, C.M. (2012) Global coverage of cetacean line-transect surveys: status quo, data gaps and future challenges. PloS one, 7, e44075.

Jo u

Kesner-Reyes, K., Kaschner, K. Kullander, S., Garilao, C., Barile, J. & Froese, R. (2016) AquaMaps: algorithm and data sources for aquatic organisms. In: Froese, R. and D. Pauly. Editors. 2012. FishBase. World Wide Web electronic publication. www.fishbase.org, version (04/2012) Lecours, V., Devillers, R., Edinger, E. N., Brown, C. J., & Lucieer, V. L. (2017) Influence of artefacts in marine digital terrain models on habitat maps and species distribution models: A multiscale assessment. Remote Sensing in Ecology and Conservation, 3, 232-246. Maldonado, C., Molina, C.I., Zizka, A., Persson, C., Taylor, C.M., Albán, J., Chilquillo, E., Rønsted, N. & Antonelli, A. (2015) Estimating species diversity and distribution in the era of Big Data: to what extent can we trust public databases? Global ecology and biogeography, 24, 973–984. Mannocci, L., Boustany, A. M., Roberts, J. J., Palacios, D. M., Dunn, D. C., Halpin, P. N., Viehman, S., Moxley, J., Cleary, J., Bailey, H., Bograd, S. J., Becker, E. A., Gardner, B., Hartog, J. R., Hazen, E. L., Ferguson, M. C., Forney, K. A., Kinlan, B. P., Oliver, M. J., Perretti, C. T., Ridoux, V., Teo, S. L. H., Winship, A. J. & Bograd, S. J. (2017) Temporal resolutions in species distribution models of highly mobile marine animals: Recommendations for ecologists and managers. Diversity and Distributions, 23, 1098-1109.

Journal Pre-proof Marchese, C. (2015) Biodiversity hotspots: A shortcut for a more complicated concept. Global Ecology and Conservation, 3, 297–309. Menegotto, A., & Rangel, T. F. (2018) Mapping knowledge gaps in marine diversity reveals a latitudinal gradient of missing species richness. Nature communications, 9, 4713. Mesibov, R. (2013) A specialist’s audit of aggregated occurrence records. ZooKeys, 18, 1–18. Mesibov, R. (2018) An audit of some processing effects in aggregated occurrence records. ZooKeys, 751, 129. Meyer, C., Kreft, H., Guralnick, R., & Jetz, W. (2015) Global priorities for an effective information basis of biodiversity distributions. Nature Communications, 6.

oo

f

Meyer, C., Jetz, W., Guralnick, R. P., Fritz, S. A., & Kreft, H. (2016) Range geometry and socio‐ economics dominate species‐level biases in occurrence information. Global Ecology and Biogeography, 25, 1181-1193.

pr

Mitchell, P. J., Monk, J. & Laurenson, L. (2017) Sensitivity of fine‐scale species distribution models to locational uncertainty in occurrence data across multiple sample sizes. Methods in Ecology and Evolution, 8, 12-21.

Pr

e-

Mora, C., Tittensor, D. P. & Myers, R. A. (2008) The completeness of taxonomic inventories for describing the global diversity and distribution of marine fishes. Proceedings of the Royal Society of London B: Biological Sciences, 275, 149-155.

al

Moudrý, V., Komárek, J. and Šímová, P., 2017. Which breeding bird categories should we use in models of species distribution?. Ecological indicators, 74, pp.526-529.

rn

Moudrý, V., Lecours, V., Gdulová, K., Gábor, L., Moudrá, L., Kropáček, J., & Wild, J. (2018) On the use of global DEMs in ecological modelling and the accuracy of new bare-earth DEMs. Ecological modelling, 383, 3-9.

Jo u

Nelson, G., Sweeney, P., & Gilbert, E. (2018) Use of globally unique identifiers (GUID s) to link herbarium specimen records to physical specimens. Applications in plant sciences, 6, e1027. O'Hara, C. C., Afflerbach, J. C., Scarborough, C., Kaschner, K. & Halpern, B. S. (2017) Aligning marine species range data to better serve science and conservation. PloS one, 12, e0175739. Osborne, P. E. & Leitão, P. J. (2009) Effects of species and habitat positional errors on the performance and interpretation of species distribution models. Diversity and Distributions, 15, 671-681.

Otegui, J., Ariño, A.H., Encinas, M. & Pando, F. (2013) Assessing the primary data hosted by the Spanish node of the Global Biodiversity Information Facility (GBIF). PloS one, 8, e55144. Pender, J. E., Hipp, A. L., Hahn, M., Kartesz, J., Nishino, M. & Starr, J. R. (2019). How sensitive are climatic niche inferences to distribution data sampling? A comparison of Biota of North America Program (BONAP) and Global Biodiversity Information Facility (GBIF) datasets. Ecological Informatics, 100991.

Journal Pre-proof Parravicini, V., Villéger, S., McClanahan, T.R., Arias-González, J.E., Bellwood, D.R., Belmaker, J., Chabanet, P., Floeter, S.R., Friedlander, A.M., Guilhaumon, F., Vigliola, L., Kulbicki, M. & Mouillot, D. (2014) Global mismatch between species richness and vulnerability of reef fish assemblages. Ecology letters, 17, 1101–10.

Peters, H., O’Leary, B.C., Hawkins, J.P. & Roberts, C.M. (2015) Identifying species at extinction risk using global models of anthropogenic impact. Global change biology, 21, 618–28. Peterson, A. T. & Soberón, J. (2018). Essential biodiversity variables are not global. Biodiversity and Conservation, 27, 1277-1288.

f

Pompa, S., Ehrlich, P.R. & Ceballos, G. (2011) Global distribution and conservation of marine mammals. Proceedings of the National Academy of Sciences, 108, 13600 –13605.

oo

Powney, G. & Isaac, N. (2015) Beyond maps: a review of the applications of biological records. Biological Journal of the Linnean Society, 115, 532–542.

pr

Robertson, D. R. (2008) Global biogeographical data bases on marine fishes: caveat emptor. Diversity and Distributions, 14(6), 891-892.

Pr

e-

Robertson, M. P., Visser, V., & Hui, C. (2016) Biogeo: an R package for assessing and improving data quality of occurrence record datasets. Ecography, 39, 394-401.

al

Rocchini, D., Hortal, J., Lengyel, S., Lobo, J.M., Jimenez-Valverde, a., Ricotta, C., Bacaro, G. & Chiarucci, A. (2011) Accounting for uncertainty when mapping species distributions: The need for maps of ignorance. Progress in Physical Geography, 35, 211–226.

rn

Rondinini, C., Wilson, K. a, Boitani, L., Grantham, H. & Possingham, H.P. (2006) Tradeoffs of different types of species occurrence data for use in systematic conservation planning. Ecology letters, 9, 1136–45.

Jo u

Saeedi, H., Dennis, T. E. & Costello, M. J. (2017). Bimodal latitudinal species richness and high endemicity of razor clams (Mollusca). Journal of Biogeography, 44, 592-604. Sikes, D. S., Copas, K., Hirsch, T., Longino, J. T. & Schigel, D. (2016) On natural history collections, digitized and not: a response to Ferro and Flick. ZooKeys, 618, 145. Šímová, P., Moudrý, V., Komárek, J., Hrach, K., & Fortin, M. J. (2019) Fine scale waterbody data improve prediction of waterbird occurrence despite coarse species data. Ecography, 42, 511-520. Troia, M. J. & McManamay, R. A. (2016) Filling in the GAPS: evaluating completeness and coverage of open‐access biodiversity databases in the United States. Ecology and evolution, 6, 4654-4669. Turak, E., Regan, E. & Costello, M. J. (2017) Measuring and reporting biodiversity change. Biological Conservation, 213,249-251. Vandepitte, L., Bosch, S., Tyberghein, L., Waumans, F., Vanhoorne, B., Hernandez, F., De Clerck, O. & Mees, J. (2015) Fishing for data and sorting the catch: assessing the data quality, completeness and fitness for use of data in marine biogeographic databases. Database : the journal of biological databases and curation, 2015, 1–14.

Journal Pre-proof Veiga, A. K., Saraiva, A. M., Chapman, A. D., Morris, P. J., Gendre au, C., Schigel, D., & Robertson, T. J. (2017). A conceptual framework for quality assessment and management of biodiversity data. PloS one, 12, e0178731. Watcharamongkol, T., Christin, P. A. & Osborne, C. P. (2018) C4 photosynthesis evolved in warm climates but promoted migration to cooler ones. Ecology letters, 21, 376-383. Webb, T.J., Vanden Berghe, E. & O’Dor, R. (2010) Biodiversity’s big wet secret: the global distribution of marine biological records reveals chronic under-exploration of the deep pelagic ocean. PloS one, 5, e10223. Wessel, P., & Smith, W.H.F. (1996) A Global Self-consistent, Hierarchical, High-resolution Shoreline Database, Journal of Geophysical Research, 101, #B4, pp. 8741-8743.

oo

f

Williams, R., Grand, J., Hooker, S.K., Buckland, S.T., Reeves, R.R., Rojas-Bracho, L., Sandilands, D. & Kaschner, K. (2014) Prioritizing global marine mammal habitats using density maps in place of range maps. Ecography, 37, 212–220.

pr

Wood, J. S., Moretzsohn, F., & Gibeaut, J. (2015). Extending marine species distribution maps using non-traditional sources. Biodiversity data journal, (3).

rn

al

Pr

e-

Zizka, A., Silvestro, D., Andermann, T., Azevedo, J., Duarte Ritter, C., Edler, D., Farooq, H., Herdean, A., Ariza, M., Scharn, R., Svantesson, S., Wengström, N., Zizka, V. & Antonelli, A. (2019) CoordinateCleaner: Standardized cleaning of occurrence records from biological collection databases. Methods in Ecology and Evolution, 10, 744-751.

Jo u

SUPPLEMENTARY MATERIALS

Additional Supporting information may be found in the online version of this article: Supplementary material S1 List of GBIF species including DOI Supplementary material S2 List of OBIS datasets used Supplementary material S3 R scripts and ArcGIS models used Supplementary material S4 Number of records and records with specific issues for individual species Supplementary material S5 Number of shared records between OBIS and GBIF for individual species

Journal Pre-proof Positional accuracy varies greatly due to coordinates rounding

Coordinate precision is specified only in 45% and 70% of records in GBIF and OBIS

Less than 20% of the records were common between OBIS and GBIF

Known marine mammals diversity hotspots show high rates of potential false positives

Jo u

rn

al

Pr

e-

pr

oo

f

Mechanisms to communicate quality issues between all groups involved are needed