Geospatial Big Data and archaeology: Prospects and problems too great to ignore

Geospatial Big Data and archaeology: Prospects and problems too great to ignore

Journal of Archaeological Science xxx (2017) 1e21 Contents lists available at ScienceDirect Journal of Archaeological Science journal homepage: http...

7MB Sizes 3 Downloads 93 Views

Journal of Archaeological Science xxx (2017) 1e21

Contents lists available at ScienceDirect

Journal of Archaeological Science journal homepage: http://www.elsevier.com/locate/jas

Geospatial Big Data and archaeology: Prospects and problems too great to ignore* Mark D. McCoy Department of Anthropology, Southern Methodist University, P.O. Box 750336, Dallas, TX, 75275-0336, USA

a r t i c l e i n f o

a b s t r a c t

Article history: Received 22 December 2016 Received in revised form 27 May 2017 Accepted 1 June 2017 Available online xxx

As spatial technology has evolved and become integrated in to archaeology, we face a new set of challenges posed by the sheer size and complexity of data we use and produce. In this paper I discuss the prospects and problems of Geospatial Big Data (GBD) e broadly defined as data sets with locational information that exceed the capacity of widely available hardware, software, and/or human resources. While the datasets we create today remain within available resources, we nonetheless face the same challenges as many other fields that use and create GBD, especially in apprehensions over data quality and privacy. After reviewing the kinds of archaeological geospatial data currently available I discuss the near future of GBD in writing culture histories, making decisions, and visualizing the past. I use a case study from New Zealand to argue for the value of taking a data quantity-in-use approach to GBD and requiring applications of GBD in archaeology be regularly accompanied by a Standalone Quality Report. © 2017 Elsevier Ltd. All rights reserved.

Keywords: Geospatial Big Data Spatial technology Cyberinfrastructure Data science

1. Introduction Archaeology has long recognized that spatial location is a core variable in our field (Spaulding, 1960). Today, we create, use, and share geospatial archaeological data on an unprecedented scale. In a recent paper, Bevan outlined many of the challenges we face with “floods of new evidence about the past that are largely digital, frequently spatial, increasingly open and often remotely sensed” (Bevan, 2015:1473, emphasis added). As our locational datasets grow, and become more accessible, so does apprehension about data quality, privacy (especially the protection of the locations of archaeological sites), and how best to manage large and growing geospatial data. At the same time, we have amassed such large databases that, on some topics, it would be disingenuous to claim we do not yet have enough data (Bevan, 2015:1477). There is a growing literature in archaeology aimed at bringing attention to how we can best use technology (Kintigh, 2006; Snow et al., 2006) to achieve our larger disciplinary goals (e.g., Kintigh et al., 2014). The need for larger and more integrated geospatial data and analyses cross-cuts virtually all of our goals and aspirations as a science (Table 1). These require us to produce data and

* The special issue was handled by Meghan C.L. Howey and Marieka Brouwer Burg. E-mail address: [email protected].

results that are scientific (testable, replicable), authentic (a faithful representation of the archaeological record and the human past), and ethical (protects cultural resources). To that end, I am guided in this paper by three questions: 1) What kinds of geospatial data are available today? 2) How will larger and more accessible geospatial databases shape the near future of archaeology? And, using a case study from New Zealand, I examine the question, 3) What can we do now about apprehensions regarding data quality, privacy, and the growing size of archaeological geospatial datasets? These questions e what data is available, what will be the impacts of larger and more accessible data, and what can we do mitigate our concerns about data e exemplify current debates about Big Data in general, and Geospatial Big Data specifically. Geospatial Big Data (GBD) can be broadly defined as data sets that include locational information and exceed the capacity of widely available hardware, software, and/or human resources. Before we go further, it is important to note that as of today, nearly all archaeological datasets fall short of being defined as GBD since the volume of data we work with rarely outstrips the capacity of available resources; with the exception of remotely sensed data (satellite imagery, lidar). But, while the volume of archaeological geospatial datasets are currently manageable, there are at least two good reasons we should begin to think about our geospatial datasets as GBD. First, due to the fragmentary nature of archaeological material evidence we are compelled to work with a broad variety of sources of data, to code complex contextual information in to a

http://dx.doi.org/10.1016/j.jas.2017.06.003 0305-4403/© 2017 Elsevier Ltd. All rights reserved.

Please cite this article in press as: McCoy, M.D., Geospatial Big Data and archaeology: Prospects and problems too great to ignore, Journal of Archaeological Science (2017), http://dx.doi.org/10.1016/j.jas.2017.06.003

2

M.D. McCoy / Journal of Archaeological Science xxx (2017) 1e21

Table 1 Grand Challenges for Archaeology and the Need for Larger Geospatial Data and Analyses. Kintigh et al. (2014:5) summary of the “most important scientific challenges” for archaeology highlight a number of areas where the need for larger geospatial datasets and analyses is paramount. The purpose of this paper is identify how we are currently building, using, and sharing geospatial data; what larger geospatial datasets will mean for the near future; and suggest ways we may overcome apprehension over the adequacy of large geospatial datasets e otherwise known as Geospatial Big Data e that will be necessary to meet our disciplinary goals. General Topic Emergence, Communities, and Complexity

Examples of the Need for Larger Geospatial Datasets and Analyses

“Archaeological data on cities range from small architectural details and short-lived cities to broad patterns of heterogeneous urban textures covering many square kilometers and presenting a historical depth of millennia. Consequently, characterizing long-term urban fabrics and animating associated behaviors via computational modeling requires enormous data archives and substantial computational infrastructure” (Kintigh et al., 2014:10, emphasis added). “Conflict is notoriously difficult to identify and quantify through archaeological remains … more systematic and large-scale analyses are certainly necessary.” (Kintigh et al., 2014:10, emphasis added). “Inequality can be systematically inferred through studies of landscape, monuments, residences, and mortuary remains … Quantitative dynamic modeling to emplace general models of sociopolitical change in specific prehistoric and historical settings … will be critical to our success.” (Kintigh et al., 2014:9, emphasis added).

Resilience, Persistence, Transformation, and Collapse

Movement, Mobility, and Migration

Cognition, Behavior, and Identity

“The archaeological record is replete with examples of the rise and fall of communities of all scales … With recent advances in the quantity and quality of archaeological and historical studies, we can uncover robust patterns in societal collapses over time and space.” (Kintigh et al., 2014:11, emphasis added).

“Typically, archaeologists have explored human mobility through a case-study approach based on archaeological and ancillary data from small-scale research projects. However, we also see the need for regional- and continental-scale studies that match the scale of the problem to the scale of particular interactions.” (Kintigh et al., 2014:13, emphasis added).

“…. how did humanity arise?... a massive body of emerging data are critical to resolving this question…” (Kintigh et al., 2014:15, emphasis added). “Tracking and evaluating localized arrangements and reconfigurations … necessitates extensive investments in digital spatial datasets that incorporate LiDAR, geophysical, and other three-dimensional data that allow virtual exploration and analysis.” (Kintigh et al., 2014:15, emphasis added).

Human Environment Interaction

“How do humans perceive and react to changes in climate and the natural environment over shortand long-terms?... The challenge is to move from case or regional studies to larger scale comparative research, and to learn how to make generalizable statements about how people make choices that draw on universal biases in cognition … [this] will require making data from relatively small field projects widely accessible and increasing current technological capabilities to allow for studies of humanenvironment interaction to increase in scope and complexity” (Kintigh et al., 2014:18e19, emphasis added).

digital format, and to interpolate trends across time and space using sparse data. These types of problems (variety, veracity, visualization) mirror issues raised by Big Data (see also Huggett, 2016). Second, from the perspective of data science our data are probably best classified as ‘embryonic’ Geospatial Big Data in that they are likely to grow extremely large in volume in the future. We have the opportunity now to shape our growing geospatial datasets before it becomes necessary to come up with specialized solutions for common tasks. It is also important to note that the problem of best practices regarding geospatial data is well-known to the subfield of geospatial archaeology, as well as archaeology that engages with computer and data science. As the science and technology dealing with GBD evolves, the hyper-technical side of archaeology is more important than ever. But, since GBD is already influencing how we write culture history, visualize our research,

and participate in public discourse about science and heritage, I felt it is timely to review and comment on this topic for a broad audience in as non-technical terms as is reasonable. 2. Geospatial big data and archaeology Today, we refer to any information, “of or relating to the relative position of things on the earth's surface” as geospatial data (Collins English Dictionary). Geospatial Big Data (GBD) is geospatial data that exceeds the capacity of widely available resources (i.e., hardware, software, human resources) and requires specialized effort to work with. Applied research in GBD tends to be driven by the perceived economic benefit of mining data to reveal spatial relationships that make businesses more cost efficient, enhance insight in to customer's behavior, and help industry make better

Please cite this article in press as: McCoy, M.D., Geospatial Big Data and archaeology: Prospects and problems too great to ignore, Journal of Archaeological Science (2017), http://dx.doi.org/10.1016/j.jas.2017.06.003

M.D. McCoy / Journal of Archaeological Science xxx (2017) 1e21

decisions. For example, when Netflix suggests movies and TV shows to watch, it does so in part based on what is popular near you. That is GBD at work. It is ‘Big’ in that Netflix is using millions of previous searches and it is ‘Geospatial’ by tagging search data by zip code and then using that variable in its recommendation algorithm (Lee and Kang, 2015:76). For data science, GBD is a subset of the more general effort of dealing with Big Data, and since “much of the data in the world can be geo-referenced,” it is hard to underestimate the “importance of geospatial big data handling” (Li et al., 2016:120). There is no specific test that would identify and classify data as GBD, rather GBD is commonly said to have one of several characteristics: volume, velocity, and variety (Laney, 2001). Other additional characteristics that have emerged including veracity, visualization, and visibility (Li et al., 2016; see also Suthaharan, 2014). Archaeology is not what data scientists had in mind when outlining what constitutes GBD. In my view, we nonetheless should begin to think of our geospatial data as GBD while they are still manageable in terms of volume, in part, so we can improve how we deal with the other issues inherent in creating, maintaining, and using GBD. Below are some examples of how the GBD characteristics apply to archaeology: Volume for web-based businesses is measured not in gigabytes, or terabytes, but petabytes (1000 gigabytes ¼ 1 terabyte; 1000 terabytes ¼ 1 petabyte). For archaeology, it is impossible to say precisely how much data there are, but we know there are two major sources of geospatial data that together represent high and growing volume. First, data coming from legacy projects, especially as we migrate the white paper backlog in to digital (reports, forms, catalogs, field notes, photographs, etc.). This is an unknown but probably substantial volume of information that will grow even though the underlying research may have been finished decades ago. Second, the often cited statistic that 90% of the world's data was generated in the last two years applies to archaeology too. Therefore, much more daunting in terms of volume is the size of satellite remote sensing (Wiseman and El-Baz, 2007), field data (GPS, photogrammetry, drones, laser scanning, etc.), and computer based research, such as simulation. Velocity is often a problem for GBD applications because of the torrent of information coming in around the clock. For archaeology, high and increasing velocity is a concern, but the inconsistency in the velocity of data is equally problematic. Take for example one of the most visible archaeological datasets on the web, the Digital Archaeological Record (tDAR). In 2011, tDAR integrated a large (þ350,000) database of reports and citations created by the US National Parks (National Archaeological Database, NADB). This was a major positive step forward for the digital archive, but in terms of velocity, it means in one year it grew six times larger than all other years combined (2008e16). Variety in sources, types, and precision of data can create intractable problems for any database. For archaeology, Cooper and Green (2015) recently summarized how in the English Landscapes and Identities project they dealt with information coming from a wide range of sources collected over generations. Even after careful research, there would sometimes be no clear way to tell, for example, if sources were talking about the same monument five times, or five different monuments in the same place. Precision in terms of geolocation is much easier to achieve today. Nonetheless, the number of ways we might record a site (i.e., map data, imagery) and index it (i.e., site name, place name, site type), means variety will continue to be a challenge. Veracity is more than locational accuracy, for archaeology it is a question of the quality of the information within a narrowly defined set of relational variables; something we commonly refer to as context. In data science, the size of Big Data is sometimes used to

3

justify the use of unverified sources of information, the underlying logic being that the sheer volume of data will overcome the inclusion of some datasets with poor accuracy. We are beginning to see this approach applied to archaeological geospatial data, and not surprisingly, this has raised concerns for how we account for context. Visualization of data is employed at all stages of research (i.e., generating hypotheses, identifying patterns, representing results) to help us make sense of abstract information. Archaeology has developed by consensus a number of methods for visualizing our geospatial data in static products (i.e., regional maps, site plans, stratigraphic drawings, etc.). Today, with the advent of 3D technologies, it much easier to also represent the forms of artifacts and sites in an interactive digital format, but these have yet to supplant static products. Visibility of archaeological geospatial datasets is at an all-time high in terms of coverage, variety, and richness. Advances in cloud technology and web GIS mean we are seeing a growth in online data repositories, as well as site location indexes, atlases, and gazetteers. Increased visibility naturally comes with increased concerns with privacy and misuse of archaeological data, and perhaps counter-intuitively, illustrates the gaps where geospatial datasets are not visible. These examples are certainly not an exhaustive list of the ways in which archaeological geospatial data has the qualities of GBD, nonetheless they illustrate why in this paper I have chosen to classify our largest geospatial datasets as GBD. 3. How we use Geospatial Big Data in the present The kinds of geospatial data that are available to professional archaeologists and the public today varies wildly depending upon region, the time period, the topic of interest, and the type of evidence of the human past. For the purposes of this discussion I have classified geospatial datasets (Table 2) in a number of different types: data repositories, location indexes, radiocarbon databases, project websites, and academic sources. This is not an exhaustive list, nor are these exclusive categories, they are instead meant to represent a cross-section of how we create, share, and use GBD in practice today. I have further broken down these categories by a qualitative summary of the sources of geospatial data used, accessibility, and quality, as a way to evaluate sources in terms of their potential for ‘data mining.’ Data mining itself is a misleading term in that the goal is not to extract a specific piece of existing data, but to discover new patterns and/or associations that would be impossible to recognize and evaluate; a process also referred to by the somewhat ambiguous, but appropriate, term from data science: ‘knowledge discovery.’ I have not attempted to review geospatial € rter databases as they apply to museums and artifact collections (Do and Davis, 2013), although I recognize that these have several added complexities in terms of data quality and the need to code locational information on provenience (of where an object was reportedly found), and provenance (where it has been since it was found, and where it is located today). One of the first-order differences between contemporary geospatial databases is the distinction between archival databases verses integrative databases. Archival datasets grow by accretion of distinct datasets, whereas in integrative datasets new data is added in to a single database, as they are available. For example, data repositories like Archaeology Data Service (ADS) and The Digital Archaeological Record (tDAR) are omnibuses that takes in single geospatial datasets that keep their distinct character and are discoverable along with any number of other types of nongeospatial data. In contrast, site indexes like the Digital Index of North American Archaeology (DINNA), digital gazetteers like Pleiades (pleiades.stoa.org), or radiocarbon databases on

Please cite this article in press as: McCoy, M.D., Geospatial Big Data and archaeology: Prospects and problems too great to ignore, Journal of Archaeological Science (2017), http://dx.doi.org/10.1016/j.jas.2017.06.003

4

M.D. McCoy / Journal of Archaeological Science xxx (2017) 1e21

archaeological materials like the Canadian Archaeological Radiocarbon Database (CARD), continuously take in new information from a variety of sources and compile it in to a single database. Each of these two kinds of datasets draws on academic and applied cultural resource management sources for information. The integrative databases, as opposed to the archival databases, are especially good at exposing clear biases in what kinds of information is currently accessible. For example, Fig. 1 shows a several kinds of geospatial data: Mesoamerican ‘sites’ (The Electronic Atlas of Ancient Maya Sites), ‘ancient places’ of the Mediterranean (Pleiades), and sample locations derived from several major radiocarbon databases and regional studies. Each uses coordinates to record location and so are easily converted in to the same data model (vector, points). The site database is a “pan-Maya registry of ancient Maya settlements” and each point has a corresponding assessment of “Site Rank” to allow for quantitative geospatial analyses. In contrast, ‘ancient places’ includes settlements and an extremely broad variety of other categories, such as place names, to aid in the qualitative analyses of historical texts. The third map shown, radiocarbon data from archaeological sites, is different again. Like the site database, radiocarbon databases are clearly built with quantitative analysis in mind, but like the ancient places database, it includes any and all kinds of phenomenon (i.e., evidence of settlements, foraging, farming, burials, etc.). It should also be noted that it is possible that some radiocarbon dates within a database reflect natural processes rather than human behavior. Nonetheless, even after combining some of the largest

archaeological databases available, the geographic bias toward research on North America and Europe is readily apparent. There may also be some geographic biases in the other two examples, but the regional-temporal focuses would appear to achieve a level of evenness not seen in other kinds of datasets. An even more difficult quality to evaluate in our largest geospatial datasets is geospatial-temporal coverage. Take for example recent studies from the Near East (Lawrence et al., 2016) and China (Hosner et al., 2016). The goals of each study are similar e to qualify and quantify paleodemographic and settlement pattern trends over the Neolithic through Bronze Ages e and both take a time-slice approach where site records are coded by cultural period (e.g., Bronze Age) and by absolute time (century or millennium scale) with a beginning/end (min/max age) and time period (e.g. 6 kya). The Near Eastern study is focused on urbanism and, using a mix of survey and remote sensing, each record includes an estimate of site size. Both studies compare their data to shifting climate regimes in their respective regions and other key variables. These are both ambitious and important undertakings using “the best available and most up-to-date coverage of archaeological sites obtained by salvage and research excavations and surveys” and recognize the underlying data has known limitations in terms of uneven geospatial-temporal coverage within the respective regions (Hosner et al., 2016: 1589, emphasis added). At first glance, the Chinese dataset appears to have much better coverage with þ50,000 site records compared with less than 400 sites in the Near East. However, if we account for the total size of the regions

Fig. 1. Geospatial big data on sites, ancient places, and radiocarbon dates. Sources: The electronic atlas of ancient maya sites: a geographic information system (GIS); pleiades e the stoa consortium; Goldberg et al., 2016; Martindale et al., 2016; Russell et al., 2014; Silva et al., 2015; Vermeersch, 2016).

Please cite this article in press as: McCoy, M.D., Geospatial Big Data and archaeology: Prospects and problems too great to ignore, Journal of Archaeological Science (2017), http://dx.doi.org/10.1016/j.jas.2017.06.003

M.D. McCoy / Journal of Archaeological Science xxx (2017) 1e21

5

Table 2 Examples of Geospatial Big Data in Archaeology. See full internet references in List of Web and Data Sources. Type of Source

Examples

Sources of Geospatial Information

Data Accessibility

Data Quality

Data Repositories

Archaeology Data Service (ADS) The Digital Archaeological Record (tDAR)

Public and private web portals. User registration required for download.

Approved digital repositories for the results of publicly funded science. Supports training in best practices.

Location Indexes

Site Indexes. Digital Index of North American Archaeology (DINNA); ArchSite (New Zealand Archaeological Association) Atlases and Gazetteers. CORONA Atlas of the Middle East,  Antiquity A-la-carte and Pleiades, American Institute of Archaeology's Archaeology of North America Canadian Archaeological Radiocarbon Database (CARD 2.0) Radiocarbon Palaeolithic Europe Database (Version 20) Preservation. CyArk (non-profit, 3D and VR), Sketchfab (for profit, community portal supporting 3D and VR) Digital Archiving. Comparative Archaeology Database (University of Pittsburg) Long-term Projects. Paleoindian Database of the Americas (PIDBA), English Landscapes and Identities Project Journals. Journal of Archaeological Science, Journal of Archaeological Science: Reports, Archaeometry, Archaeological Prospection Academic Libraries and Centers. Stanford Geospatial Center, Harvard Geospatial Library, Ancient World Mapping Center (Brown), Center for Advanced Spatial Technologies (Arkansas)

Stand-alone static datasets (GIS layers, locations as fields in other datasets) from academic, government, and cultural resource management. Single databases from a union of site information from academic, government, and cultural resource management.

Web-GIS, public and private web portals. Most require user registration for download/access.

Mix of approved digital repositories for the results of publicly funded science and geodata built for public consumption. Supports training in best practices.

Single databases from a union of independently reported radiocarbon results with locational information.

Public and private web portals. User registration required for download.

Long-term projects with updates to correct errors.

Variable. Datasets reflect the goals and focus of study but can be broken in to broad categories (atlases, preservation, archiving, etc.).

Public web portals. No user registration required. Some allow download.

Variable. Some long-term projects are a union of datasets, others are stand-alone databases from completed projects.

Some archaeological journals allow optional supporting datasets with geospatial information. University support centers and libraries provide historic and environmental datasets useful for archaeology.

Public web portals. Journal articles with supplemental data may be behind the ‘pay wall’. User registration required for some downloads.

Datasets published as supplemental material have undergone peer-review. Selfarchived datasets may or may not have been reviewed.

Radiocarbon Databases

Project Websites

Academic Sources

examined (~0.3 million vs ~9.6 million square km), the Chinese dataset has only about four times as many records per million square km as the Near Eastern dataset, and if one considers coverage within contemporaneous time periods, the density of sites is even more similar. Between 3 and 4 kya, there are less than three times as many sites per million square km in the Chinese data than the Near Eastern data (19,837 sites/9.6 mil sq km vs. 222 sites/ 0.3 mil sq km). The lesson here is clear e these datasets are more alike than one would estimate just based on the number of records when one accounts for space and time. The quality of geospatial data goes beyond simply reporting the provenience of artifacts, or the location of sites, and is probably best thought of in terms of how well the dataset conforms to established best practices. Most archaeological geospatial datasets rely on users to self-police when it comes to quality and report critical information that others might use to evaluate quality as metadata. Here again the distinction between archived and integrative is a critical one; archived data are frozen in place, integrative data can be revised with updated versions. This does not mean one is better than the other, or that high quality studies will necessarily have matching high quality geospatial datasets. For example, we would expect archived data published as ‘supplementary material’ in peer-reviewed journals to confirm closely to best practices. Unfortunately, geospatial data underlying many, if not most, studies is

simply missing. Major academic journals, including the Journal of Archaeological Science, do not require publishing geospatial data alongside of new research. This is not a problem that is unique to geospatial data, but one that could be fixed (see Horsburgh et al., 2016 for a similar critique regarding the lack of rigorous publication of zooarchaeological data). Publishing the locations of archaeological sites raises is a number of serious privacy issues. In a rare exception to the Freedom of Information Act in the United States, federal archaeologists routinely withhold “the location of archaeological sites that are not formally open for public visitation … to protect the sites from looting and vandalism” (Hitchcock, 2006:471). Large databases of site records have also been used to fight looting, as has been the case in satellite imagery monitoring of the impacts of looting on culture heritage in the Near East (Contreras and Brodie, 2010; Stone, 2008, 2015). For example, in Syria, Casana (2015:150) has discovered through examining the impacts of looting over time “… that war-related looting is most frequent and most widespread in Kurdish and opposition-held areas, which are, perhaps unsurprisingly, also the regions with the weakest centralized authority.” 4. Use of Geospatial Big Data in the near future Kintigh et al. (2014) clearly outline why we should aspire for

Please cite this article in press as: McCoy, M.D., Geospatial Big Data and archaeology: Prospects and problems too great to ignore, Journal of Archaeological Science (2017), http://dx.doi.org/10.1016/j.jas.2017.06.003

6

M.D. McCoy / Journal of Archaeological Science xxx (2017) 1e21

larger and more accessible geospatial databases, but in my review of how we currently use GBD, I found several trends that I believe are strong indicators as to how this technology will be applied in the near future. To be clear, these are not necessarily the most important or innovative uses of GBD, and I do not predict that these will change the fundamentals of how we define our goals. I do however take the position that the changes I highlight here will apply broadly, and not just to a sub-set of tech savvy experts. Having said that, as we approach the volume of Big Data, we may find it more and more necessary to use of machine learning to classify data (e.g., Maaten et al., 2007), engage in far more open data archaeology (Huggett, 2014, 2015a, 2015b), and rethink how we create datasets with downstream Big Data analysis and simulation in mind (Barton, 2013; Kansa et al., 2014; Kintigh, 2015). 4.1. Culture history in the era of Geospatial Big Data Archaeology has had a long love affair with GIS. There were concerns early on that GIS would lead us down a path to environmental determinism (Gaffney and van Leusen, 1995). The evidence to the contrary is all around us in the variety of uses we have put GIS to (McCoy and Ladefoged, 2009). I am not overly concerned about newer additions to our spatial technology tool kit pulling us down one or another theoretical path. Geospatial Big Data, in my view, will not lead us down an unintended path given that GBD is something that has attracted our colleagues in both the Earth Sciences (e.g., Karmas et al., 2016) and in Digital Humanities (e.g., Bodenhamer et al., 2010). This is not to say that it will have no effect on how we do archaeology, and here we turn to the topic of how we write culture histories in the era of GBD. The best example of how GBD can, and will, influence how we write culture histories of the prehistoric past is in the use of radiocarbon mega-databases. From the beginning of the use of radiocarbon dating in archaeology, we have seen value in the collection of regional radiocarbon databases. As early as 1960s, Green (1964), made the case for a standardized paper index card system for radiocarbon dates in Oceania (see also Jelinek, 1962). Today, radiocarbon databases continue to be regional organized, and they are most often, but not exclusively, applied to one of two global phenomenon: (1) migrations (Silva and Steele, 2014) and demography (Steele, 2010) of modern humans in the Pleistocene in Europe (Vermeersch, 2016), Australasia (Field et al., 2007; Williams et al., 2014), and the Americas (Chaput et al., 2015; Delgado et al., 2015; Goldberg et al., 2016; Peros et al., 2010); and (2) the spread  and of Neolithic farmers, or domesticates, in Europe (Crombe Robinson, 2014), Asia (Silva et al., 2015), the Americas (Lemmen, 2012), Africa (Russell et al., 2014), and Polynesia (Mulrooney, 2013; Wilmshurst et al., 2011). The geographic distribution of the databases that underlay this research by necessity stretch beyond national boundaries, and vary in total size from a few hundred to tens of thousands of records. For the most part, they use the coordinates (latitude, longitude; northing, easting) of the site where a radiocarbon date was reported. There are always a small fraction of dates without the requisite site location coordinate information. These databases often have extensive information on the context, material dated, and laboratory results. Concerns regarding studies using mega-radiocarbon databases naturally vary from case to case e and at this stage probably warrant their own lengthy review e but, in brief, concerns tend to center around a few related key points. First, there is the question of what underlying phenomenon is being measured. These databases are the result of many different studies and it is not always clear what thresholds have been used to separate natural from cultural phenomenon, and proxy measures for the presence-absence or intensity of activity in a location is study dependent (Attenbrow

and Hiscock, 2015). Second, there is the question of sampling. There will always be locations, and time periods, that will be oversampled or under-sampled, due to the natural process of taphonomy and the spatially discontinuous nature of archaeological research. Even the largest mega-radiocarbon database studies recognize that these must be dealt with (Chaput et al., 2015:12131). Lastly, and related to the question of sampling, is the transformation of data. Radiocarbon dates are statements of probability, and these probabilities are often transformed, as pooled or summed probabilities (e.g., Bamforth and Grund, 2012; Contreras and Meadows, 2014), or as Bayesian models (e.g., Long and Taylor, 2015), to identify trends over time. When it comes to transforming data over space, the interpolation of data points is a welltrodden path for geospatial analyses. Results are often presented as time-slice maps, and short videos, as seen in recent studies of demography in Ireland (vector, point; McLaughlin et al., 2016), and North America (raster, heat-map; Chaput et al., 2015). Historical archaeology in North America, where radiocarbon dating is more rarely used, presents an interesting counter to the role of GBD in archaeology. For example, the Digital Archaeological Archive of Comparative Slavery (DAACS), is an extraordinarily data rich “Web-based initiative designed to foster inter-site, comparative archaeological research on slavery” including þ2 million artifacts, chronological information (mean ceramic date; South, 1977) that can be derived a number of ways depending on the query. Like radiocarbon databases, the DAACS is designed to be regional (geographic coverage includes the US Southeast and the Caribbean). The quantity of sites is many times lower than regional radiocarbon database (DAACS includes 73 sites), but the quality of information is outstanding, including site plans, Harris matrixes, and a range of other types of information. One factor should concern archaeology, no matter what time period or region, is balancing new opportunities for writing culture histories based on large geospatial datasets, against the unintended thoughtlessness toward context that such studies could promote. Specifically, the kind of thoughtlessness that concerns me comes from either leaving out important data because it is difficult to include in the dataset or over-including inappropriate data simply because it is easy and available. So, while on the one hand we cannot let our GBD can become ‘hoppers’ filled with data that is handy, it would be a waste if we fail to gain the benefit that data mining large databases would allow. We have to come to terms with the fact that some results will migrate to geospatial datasets well and others will be much more difficult. With existing databases there are a number of options including treating large databases as they were complete to expose biases (e.g., Cooper and Green, 2015), using computational models (e.g., Barton et al., 2010) to overcome sampling problems, and to use data on modern population and land cover as proxy measures for systemic bias in recovery (e.g., Miller, 2016). This is especially important to identify and account for gaps due to recovery bias since large geospatial radiocarbon databases can reveal periods of demographic collapse that should be of keen interest to archaeology (Shennan et al., 2013; McLaughlin et al., 2016; Mulrooney, 2013; Zubimendi et al., 2015). 4.2. Archipelagos of geospatial data and decision making One of the times when geospatial data is most critical is when making decisions regarding site preservation; a factor weighted carefully in academic research and cultural resource management. Take for example the concerns over the current and future impacts of the Dakota Access Pipeline (DAPL) to the natural environment and cultural sites. The DAPL project is a 1886 km (1172 miles) long pipeline that is designed to bring oil produced in North Dakota to

Please cite this article in press as: McCoy, M.D., Geospatial Big Data and archaeology: Prospects and problems too great to ignore, Journal of Archaeological Science (2017), http://dx.doi.org/10.1016/j.jas.2017.06.003

M.D. McCoy / Journal of Archaeological Science xxx (2017) 1e21

production facilities in Illinois. At the time of writing, intense protests continue in the tribal territory of the Standing Rock Sioux in North Dakota centering mainly on water quality but also the effect the project has had on sacred sites. Colwell (2016) rhetorically asks, “What sacred sites have been damaged by [the pipeline]? We can't really know for certain e and our legal system [in the US] is partly to blame.” The Society for American Archaeology has written to the US Army Corps of Engineers to ask for a review of the project citing a number of problematic signs that the approach taken to tribal consultation process was too piecemeal and the area under consideration may have been too restrictive (GiffordGonzalez, 2016). It is unclear how the DAPL situation will ultimately be resolved but it does throw in to high relief some of the consequences of ‘data silos.’ Here I am using the term silo to refer to the degree of access the public has to geospatial data. For example, the company that is building the pipeline, Energy Transfer Partners, presents the proposed route of the pipeline online in five static maps (daplpipelinefacts.com); one that highlights the 50 counties it crosses, and four state-scaled maps of North Dakota, South Dakota, Iowa, and Illinois. The actual route is ‘silo-ed’ by the fact that only static, low quality maps are made available. There have been a number of efforts to counter this by mapping the route reportedly based on public records and crowdsourced information (Nitin Gadia, bakkenpipelinemap.com). Maps have also been a used as a form of protest of the DAPL including one juxtaposing the proposed route and “unceded Sioux land under 1851 treaty” (northlandia. wordpress.com), and a map showing the location of protests with Lakota/Dakota place names (map by Jordan Engle and Dakota Wind) as part of The Decolonial Atlas (decolonialatlas.wordpress.com). To varying degrees these reflect a broader trend of more, largely untrained, private citizens participating in Volunteered Geographic Information (Goodchild, 2007), exemplified by OpenStreetMap.org, a platform that has been used in humanitarian ‘crisis mapping.’ The natural question is where along the proposed route are known archaeological sites, and here we see other examples of data silos. The environmental assessment report prepared by Energy Transfer Partners for the US Army Corp of Engineers for the Illinois segment of the proposed route in part reads (Dakota Access, 2016:66): “A check of previously-recorded cultural resources was undertaken within a 1.6-km (km) (1.0-mile) radius of the Proposed Action Areas/Connected Action Areas prior to the commencement of fieldwork. Online databases were consulted, including the National Historic Landmark list and the National Register of Historic Places. The Historic and Architectural Resources Geographic Information System (HARGIS), maintained by the Illinois Historic Preservation Agency (IHPA), was consulted for locational and other information regarding historic buildings, historic engineering structures, and cemeteries. The Illinois Inventory of Archaeological Sites geodatabase, maintained by the Illinois State Museum, was consulted for locational and other data regarding recorded archaeological sites and previouslyreported archaeological surveys and excavations. The Illinois Cultural Resource Management Report Database, maintained by the University of Illinois, was consulted for detailed information available in previous reports. General Land Office maps were researched at the Federal Township Plats website maintained by the Illinois Secretary of State. Old county plat maps and atlases were researched at the Illinois State Library and the Galesburg Public Library.” This type of convoluted site record searching is typical of the due diligence required in cultural resource management, and since each

7

of the four states on the proposed route have their own sets of relevant state agencies, universities, and libraries, querying more than a dozen sources of geospatial data is required to identify previously recorded sites. The administrative and institutional silos highlighted above of course do not apply to national scale databases that cross US state administrative boundaries; these larger datasets belong to a silo defined by cultural value. Specifically, the National Register, and to a lesser degree and National Landmark designation, include places that the US government considers of national importance. The National Register lists þ90,000 locations, and is accessible as a web based GIS point layer online (nps.gov). In the Standing Rock Sioux's territory, and across the US, the National Register is mainly made up of historic buildings. Therefore, in rural areas the density of sites is low; indeed, there are only 441 in all of North Dakoka (as of July 2015). Most relevant to the DAPL project, is the question of Traditional Cultural Properties (TCP). Within the National Register a TCP “is a property … eligible for inclusion … based on its associations with cultural practices, traditions, beliefs, lifeways, arts, crafts, or social institutions of a living community” (nps.gov). This is a category defined by value of a place to local groups, and while it is common for physical evidence of the past (e.g., an archaeological site) to also be a cultural site, sacred places are not necessarily marked by physical evidence of past activities, nor are their locations necessarily something that is appropriate to be shared broadly. Indigenous geographers have made in-roads in thinking through how to use advances in spatial technology (Dobbs and Louis, 2015), but consultation and collaboration remain the best way to identify TCP. Crowdsourcing is one avenue to break down silos and the use of crowdsourcing to fund archaeology and create geospatial datasets has attracted a great deal of attention. Bonacchi et al. (2015) describe lessons learned from the Micropast website (crowdsourced.micropasts.org) - a site used to try and attract crowdfunding and which also served as a portal to access the results of crowd sourced data - including that crowdfunding was most effective when used as a catalyst for more mixed models along with offline donations. Parcak's new GlobalXplorer website (globalxplorer.org) is aimed at attracting crowdfunding as well as spreading the analysis of satellite imagery through “creating a global network of citizen explorers”. But, without a way to access the data that is created through this volunteer science, the results may prove to be another data silo. While I have emphasized the roles of silos, I would note that the notion of creating a top-down, single geospatial cyberinfrastructure (CI) is probably doomed to failure. Snow et al. (2006) highlighted the fundamental problem of our inability to simultaneously access different categories of information (databases, grey literature, and images) and pointed out CI should be allowed, “to evolve as it is adopted, used, and contributed to by a community … to do so also involves solving problems of confidentiality and trust, and securing long-term commitment from agencies” (Snow et al., 2006:959). Along those lines, we are beginning to see the organic conglomeration of independently created geospatial databases in to archival or integrative databases. For example, the Canadian-based CARD database (Martindale et al., 2016) now includes a massive database on Paleolithic Europe (Vermeersch, 2016), and the Australia-based Field Acquired Information Management System (FAIMS) project moved their repository due to a lack of funds and resources to the US-based Digital Archaeological Record (tDAR). However, to be clear, I do not believe this portends a single CI for geospatial data in archaeology on the horizon. One thing that I would like to see, and I think we will see, is more visibility between geospatial datasets by creating more geoportal platforms to connect archipelagos of related datasets. Well-

Please cite this article in press as: McCoy, M.D., Geospatial Big Data and archaeology: Prospects and problems too great to ignore, Journal of Archaeological Science (2017), http://dx.doi.org/10.1016/j.jas.2017.06.003

8

M.D. McCoy / Journal of Archaeological Science xxx (2017) 1e21

tended silos are how Geospatial Big Data are going to grow and be quality checked. Without them we lose institutional knowledge of the ‘character’ of databases and the ability to update them. But at the same time we need to build digital architecture that allows us to see that other data silos exist, even if access is limited. The technology for creating archipelagos of geospatial datasets exists and is improving - the European Union's Inspire Geoportal (inspire-geoportal.ec.europa.eu) is a good example - and the marketplace for cloud computing web GIS (i.e., CartoDB) will likely give archaeology many more options than currently available as desktop software (i.e., ESRI's ArcGIS; Quantum GIS) or web mapping (i.e., Google Maps, Google Earth). 4.3. Visualization and ways to see the past It is difficult to underestimate the power of good visuals in communicating the results of archaeological research and connecting with the public. The high-resolution satellite imagery showing major sites before-and-after looting, for example, is a powerful way to illustrate ongoing threats to cultural heritage. It also goes without saying that the proliferation of UAV (unmanned aerial vehicles), 3D terrestrial laser scanning, and other spatial technologies is multiplying the volume of geospatial data available to archaeology at a breakneck pace. It is interesting to see how archaeology deals with the inherent difficulties in visualization of GBD. Data visualization when it comes to GBD pushes against the natural limitations of what the human eye can visually process; one reason why we see more use of geographic scale-dependent representations (see for example, DINAA). In archaeology, the topic of social network analysis (SNA) is a domain where we are seeing lots of visualization of large geospatial databases. SNA comes with its own data model (nodes, links) and diagrams that illustrate networks. The connection back to geographic space, and geographic relationships, can be embedded within the visualization (i.e., Clark et al., 2014), or the results of the SNA can be mapped on to the real world in some fashion (i.e., Mills et al., 2013). Representative visualization e that is trying to get across to your audience how archaeology looks today, or looked at some point in the past e is more accessible as 3D models become easier to create and manipulate through technologies like structure-from-motion 3D models (i.e., AutoDesk's 123D Catch), and more user-friendly computer aided drawing software (i.e., Trimble's SketchUp). The results of professional surveying are also reaching a larger audience through work by outfits like CyArk and community sharing web platforms like Sketchfab. 3D models are being integrated within web GIS, through Google Earth's Street View that allow viewers to ‘visit’ an archaeological site, and as Virtual Reality (VR) becomes more commonplace, these virtual visits will certainly become more immersive. There are a number of great examples of the use of GBD and social media, such as a recent web GIS (CartoDB) visualization of geotagged posts from around the world as part of the Day of Archaeology Project (jessogden.carto.com). The potential for education and public outreach through social media is clear to see, even if it is less clear how exactly it will unfold as technology and tastes change. Other trends, like the Internet of Things (IoT), are even harder to predict how they will articulate with the goals of archaeology, but certainly as the gap between the digital and modern things becomes smaller, so will the gap between the digital and ancient things (see also Horton, 2014). 5. Geospatial Big Data in action: Maori fortifications (Pa) Data quality, privacy, and the growing size of our datasets are

problems that need to be faced head on or they will have a paralyzing effect on advances in archaeology. I present three small studies on the archaeology of fortifications in New Zealand, called pa by Maori. For these examples I will employ different types of data to examine culture history, evaluate public and professional site records, and create visualizations. For the purposes of this paper I am taking a ‘data quality-in-use’ approach (Merino et al., 2016), meaning that I do not presume to know the adequacy of the existing GBD, or the improvements necessary to achieve the tasks at hand, before the study. Rather, a post hoc assessment is made of the adequacy of the original dataset and improvements in a Standalone Quality Report. I recognize this goes against our instincts as scholars and looks like we are abandoning our core values regarding data quality. To the contrary, what I am advocating is finding a productive way to apply those values on Geospatial Big Data and identify issues in use, and share how they have been dealt with, so we can have better GBD in the future. Maori fortifications are an example of a topic about which we are data rich and information poor. We know, for example, when they began to be built, we know about how many were built, we know many more were built in warmer, coastal environments with good farmland and high population density. It remains unclear, however, if population density was always high in regions with good farmland from first settlement of the islands, or if there were any geographic shifts in where fortifications were built over time. New Zealand was first settled after 1250 CE through longdistance voyages from Eastern Polynesia (for a recent summary, see Dye, 2015). The first centuries of New Zealand's culture history, referred to as the Early Period (1250e1450 CE) includes strong evidence for a highly mobile settlement pattern across the country, but no fortifications (Walter et al., 2010). There remains no strong evidence for the construction of fortifications until around 1500 CE (Schmidt, 1996; McFadgen et al., 1994), in the Middle Period (1450e1650). The use of fortifications was documented by European visitors during the Late Period (1650e1800 CE) and in the Historical Period (after 1800 CE) when Maori continued to use traditional fortifications with adaptations for the introduction of muskets. As noted above, we have a good idea of the number, geographic range, and preferred location of fortifications (Fig. 2). The total number of fortifications built by the ancestors of Maori over three centuries has been given in various sources as being between 4000 and 6000 (Davidson, 1984), followed by more specific site record based figure of 6528 (Schmidt, 1996), and present professional site records give a figure of 7314 (ArchSite, 2017). Fortifications have been recorded across the entirety of New Zealand's two major islands and offshore islands. There is however a well-known preference for northern, warmer, coastal environments, as seen in the site predictive model shown in Fig. 2 (Leathwick, 2000). The North Island, and the northern parts of the South Island, are the only locations suitable for the crops that the ancestors of Maori brought with them to New Zealand, and so while the paleodemography of the islands is currently a matter of speculation, we presume that the agricultural economy in the warmer north allowed for a much faster growth rate than the hunting-fishing economy of the colder south (Davidson, 1984:56e59). And so, the spatial distribution of fortifications is positively correlated with both good farmland and the regions with the highest population density at the time of European contact. In Allen (2006) summary of research on Maori fortifications he identifies ecological, political, and symbolic perspectives. The ecological model, originally conceived by Vayda (1960), suggests that seizing cleared gardens from neighboring groups became less difficult than finding and clearing new land as the population grew. The political model interprets the distribution of larger

Please cite this article in press as: McCoy, M.D., Geospatial Big Data and archaeology: Prospects and problems too great to ignore, Journal of Archaeological Science (2017), http://dx.doi.org/10.1016/j.jas.2017.06.003

M.D. McCoy / Journal of Archaeological Science xxx (2017) 1e21

9

Fig. 2. General model for the spatial distribution of Maori fortifications in New Zealand. It is well established that the majority of fortifications and other pre-European contact era archaeological sites are found in the warmer northern region of New Zealand, especially in coastal environments, as shown in this example of site predictive model (Leathwick, 2000:Fig. 3).

fortifications in ecologically rich areas as reflecting the consolidation of resources and power by chiefdoms (Allen, 1994, 1996, 2008; Earle, 1997) and the symbolic model sees fortifications as symbols in larger Maori cosmology (Barber, 1996). 5.1. Temporal GBD and the fortification of New Zealand GBD can be used to create a geospatial-temporal model of the fortification of New Zealand. As noted, the date for the onset of fortifications is around 1500 CE and the preference for fortification construction in northern, warmer locations better suited for farming have already been determined in previous studies. Here radiocarbon dates from across New Zealand were used to estimate the distribution of evidence for fortification use over time slices to

allow us to determine if geographic preference is something that was evident from the earliest periods, or did it vary over time (see also McFadgen et al., 1994). To approximate the spatial distribution of population e a factor currently not currently quantified in the study area e the frequency of radiocarbon dates will be used to model population distribution from settlement (1250 CE) through the period of fortification use. The methods applied here are adapted from Chaput et al. (2015). 5.1.1. Sources of data The primary source of archaeological data for this study is an archived radiocarbon database created by the Waikato Radiocarbon Lab called NZ C14 Data (version 0.5) (http://www.waikato.ac.nz/ nzcd/C14kml.kmz) (Fig. 3). Created about 15 years ago to

Please cite this article in press as: McCoy, M.D., Geospatial Big Data and archaeology: Prospects and problems too great to ignore, Journal of Archaeological Science (2017), http://dx.doi.org/10.1016/j.jas.2017.06.003

10

M.D. McCoy / Journal of Archaeological Science xxx (2017) 1e21

Fig. 3. Geospatial Database Radiocarbon Dates in New Zealand. This image shows an archived geospatial database (kmz) created by Waikato Radiocarbon Lab.

compliment an online database of radiocarbon dates (www. waikato.ac.nz), the database includes n ¼ 1671 dates. The database is remarkably rich with over three times the density as the North American coverage in the CARD database, the bases for Chaput et al. (2015), over a much shorter time frame (~700 yrs, as opposed to 12.5 ky). It is in a Google Earth format (kmz), which was still new at the time it was created, and has a clear disclaimer that it comes with no claims of quality. The online database however does have a great deal of relevant metadata. Environmental data for this study was sourced from the New Zealand government's Land Information (LINZ) division and was created as part of the Land Environments of New Zealand (LENZ) classification (Fig. 4). This included a polygon layer of the country representing the main islands and over 900 offshore islands that was filtered in this study to include only the four largest islands (North Island, South Island, Steward Island, Great Barrier Island) for ease of processing. To give a general approximation of the climate at different locations a layer of average modern temperature was used with the caveat that there are known shifts in the climate over the period of human occupation of New Zealand in the pre-European era (1250e1769 CE), notably the Little Ice Age.

5.1.2. Methods The archaeological geospatial database (NZ C14 Data) underwent a number edits, transformations, and filters to generate raster time-slices of population and fortification distribution. These steps are summarized below: 1) Assembling Archived Data, 2) Adding Environmental Variable to Data, 3) Coding Site Type, 4) Coding Temporal Values, 5) Spatial Sampling, and 6) Raster Interpolations. Assembling Archived Data. The point layer format (kmz) of NZ C14 Data was transferred to ESRI's ArcMap 10.3. Locational information (lat, long) and radiocarbon lab identification transferred easily, however, other data (site type, material dated) did not migrate smoothly. It was necessary to search the online database using radiocarbon lab identification numbers to re-attach this information to records. Once complete, a point shapefile was created in ArcMap that was transformed to a local datum and projection (NZ 2000 Map Grid). Adding Environmental Variable to Data. To allow the results to be quantified relative to the local climate where radiocarbon dated materials were found, a raster representing mean annual modern temperature was used to add a field (Temperature) to the point record. Some points (n ¼ 199) were outside of the raster and

Please cite this article in press as: McCoy, M.D., Geospatial Big Data and archaeology: Prospects and problems too great to ignore, Journal of Archaeological Science (2017), http://dx.doi.org/10.1016/j.jas.2017.06.003

M.D. McCoy / Journal of Archaeological Science xxx (2017) 1e21

11

Fig. 4. Average Modern Temperature and Distribution of Radiocarbon Dates. Temperature ranges from 6.0 C (black) to þ16.2 C (white). Sources: LINZ, Waikato Radiocarbon Lab. Data reproduced with the permission of Landcare Research New Zealand Limited.

excluded from the analysis. Excluding so many records was not ideal, however, since there did not appear to be any systematic error that caused them to be outside the environmental raster, the analysis moved forward. Coding Site Type. The native classification of radiocarbon dates by site type included 43 recognized types including several types of fortifications. In practice, there were 47 unique site types coded in the dataset, mainly due to typos in the original data. To simplify how sites were coded a new field called “Grouped_Type” was created with 10 options, one for fortifications and the remainder based on broad formal/functional designations. Coding Temporal Values. The most time consuming aspect of this analysis was filtering, classifying, and coding temporal information. First, the recent and brief period of human occupation of New Zealand means that one must have a protocol for interpreting radiometric results that overlap. In addition, there is question of inbuilt age in unidentified charcoal (see Dye, 2015 for a recent discussion of this issue). Smith (2010) outlined one such protocol in

which dates were included or excluded based on material type and sorted in to Early, Middle, Late, and Historical periods as well as overlapping periods (e.g., Early-Middle, Middle-Late). This yields good results but of course requires a great deal of individual evaluation of dates in terms acceptability of material and distribution of multiple intercept dates at 1- and 2-sigma. To make it possible to process a large number of dates, a rule-based filtering was required. In this case, all material was classified as either identified terrestrial charcoal, marine shell or bone, or other. No attempt was made to filter out long-lived verses short-lived charcoal among the identified charcoal since the way the material was reported was not wellsuited for searching and classifying. Second, radiocarbon dates were calibrated using Calib 7.1 (Stuiver et al., 2017). Terrestrial charcoal was calibrated using the Southern Hemisphere curve (SHcal13) and marine material using the recommended marine calibration (MARINE13) as described in Smith (2010). Dates with greater than 1000 CRA were filtered out to exclude most dates on pre-human natural phenomenon, and dates

Please cite this article in press as: McCoy, M.D., Geospatial Big Data and archaeology: Prospects and problems too great to ignore, Journal of Archaeological Science (2017), http://dx.doi.org/10.1016/j.jas.2017.06.003

12

M.D. McCoy / Journal of Archaeological Science xxx (2017) 1e21

equal to or less than 120 CRA were filtered out from analysis as being too recent. For this analysis the resulting mean intercept date was used to put dates in to century scale periods and these were given cultural period names adapted from Smith (2010) with the addition of “Natural” to account for periods that likely include all or some dates that pre-date human settlement of New Zealand in 1250 CE. Spatial Sampling. As Chaput et al. (2015) note, over-sampling is a major concern in these types of analysis since there will be some sites, and regions, that have far more dates than others, as without some spatial sampling would appear more ‘active’ than is warranted. In this case, most sites (~60%) have a single date in any given period, and for these it was straight forward to code them as present/absent for each period. For the remaining sites, these were also coded as present/absent for each time period regardless of the number of dates within that period. Raster Interpolations. Raster were created based on the method described in Chaput et al. (2015) using a kerging density function (ArcMap 10.2) with a search radius of 600 km (Figs. 5 and 6). These were normalized using the highest density results (using the Raster Calculator function), to the Middle Period (mean dates of 1500e1600 AD). The coastline of New Zealand creates an edge effect, complicated by the fact that some dates (n ¼ 19 out of 495) are mis-located off the coast. Therefore, two methods were run; one where there was no spatial constraint on the raster, and then the results was clipped to the island polygon layer, and another where the island polygon layer was used to restrict the raster calculation.

5.1.3. Results The estimate of fortification and population distribution yields three generalized phases. First, in the period before fortification, the population distribution was remarkably uniform across the islands, with indications that the South Island may have been more

heavily used. Second, concurrent with the earliest signs of fortifications, evidence of human activity is shifts to the North Island, and the earliest fortifications show a preference for the North Island that is consistent through 1400e1600 CE. Third, the phase when we see the beginning of sustained contact with Europeans, 1600e1800 CE, activity on the South Island continues to fall, although we do find the first dates of fortifications in the far southern reaches of the South Island. This last trend is represented on both the unrestricted and coast-restricted rasters, although the later leaves off the southernmost date from a fortification. The geospatial trends identified are also seen in modern mean temperature at the locations where radiocarbon dates were reported. Fig. 7 shows the range of values in different site classifications (pa, non-pa, all dates). There is a consistent preference for warmer climates for fortifications over time, broadening in the final periods; as well as a shift from activity that reflects no preference, or possibly a southern/colder preference, to activity shifting toward the temperature range of fortifications.

5.2. Site records GBD and the distribution of fortifications (Pa) GBD of fortifications in New Zealand are a good example of the difference between professional (privately maintained and restricted) and publically available site records. As noted above, in New Zealand the spatial distribution of fortifications is not in dispute, nor is the value of science and scientific facts, and so it is largely unnecessary to make such a comparison. But, given that in the United State at the moment there is a real danger to the authority and value of scientific data, specifically when it is perceived as an impediment to economic development, it is in archaeology's interest to be able to clearly show that professional site records, while often kept from the public view to avoid looting, do exist and are the ‘tip of the iceberg’ when compared with sites that are better known.

Fig. 5. Distribution of Radiocarbon Dates: Population. This time series shows all radiocarbon dates as a proxy for spatial distribution of population (after Chaput et al., 2015). See text for description of periods.

Please cite this article in press as: McCoy, M.D., Geospatial Big Data and archaeology: Prospects and problems too great to ignore, Journal of Archaeological Science (2017), http://dx.doi.org/10.1016/j.jas.2017.06.003

M.D. McCoy / Journal of Archaeological Science xxx (2017) 1e21

13

Fig. 6. Distribution of Radiocarbon Dates: Fortification. This time series shows all radiocarbon dates as a proxy for spatial distribution of population (after Chaput et al., 2015). See text for description of periods.

Fig. 7. Average Temperature at Locations with Radiocarbon Dates. When organized by time series, it appears that fortifications favored warmer areas from their first appearance and grew to include colder regions through time. The general trend for the proxy for population distribution appears to shift toward warmer locations over time with a broad range of environments occupied throughout.

5.2.1. Sources of data The New Zealand Archaeological Association (NZAA) has been responsible for the systemically documentation of archaeological sites for decades, first as a paper record (also known as the Archaeological Site Recording Scheme), and today as ArchSite (archsite.org.nz). Although the NZAA partners with government agencies, it is an independent charity created “to protect New

Zealand's cultural heritage and to publish, promote and foster research into archaeology.” The site database is also a resource for local tribal cultural resource managers, although Maori continue to be under-represented in archaeology (Rika-Heke, 2010). In the early days of the online database there was some concern from indigenous scholars that it would be commercial scheme; it nonetheless remains non-profit and governed by a board drawn

Please cite this article in press as: McCoy, M.D., Geospatial Big Data and archaeology: Prospects and problems too great to ignore, Journal of Archaeological Science (2017), http://dx.doi.org/10.1016/j.jas.2017.06.003

14

M.D. McCoy / Journal of Archaeological Science xxx (2017) 1e21

from the NZAA membership and web administrator who approves changes and additions to the database (Bickler, personal communication). The ArchSite web portal has a public face where one can browse a map of 68,753 archaeological sites across the country. Visitors are blocked from zooming in to a scale that might give away the specific location of sites and full access to the geospatial database is granted only to members of the NZAA. This is an integrated database with key information its history (older site identification number, current identification number, last time the record was updated, if its location has migrated from white-paper records or GPS, etc.). It also includes a much cleaner site type classification than was in use when the NZ C14 Data was created so it is straightforward to identify the 7314 fortification (pa) across the country. The public data of fortifications was created by Land Information New Zealand (LINZ) based on government topographic maps (1:150,000 scale) that has migrated to GIS (Fig. 8). It includes 2135 locations and is available for download at LINZ (linz.govt.nz). It is also currently hosted on OpenStreetMaps.org. While it is not connected to the professional database, both list the traditional name of fortification, if known, and in some cases it would be possible to link them through location. For this study, it was critical that the locations of sites in ArchSite, the professional archaeological database, were not inadvertently revealed. To do this, a polygon layer representing New Zealand's “Map Grid” system was used to summarize the distribution of sites. Map Grid is already used as the first alpha-numeric in site records in New Zealand (for example, site P05/214 is located in map grid reference location P05). Reference grid vary slightly across the country but generally are about 7.5 km (N-S) x 5.0 km (EW), covering about 37.5 square kilometers. Also, since grids go

beyond the coastline, no points were left out of the analysis due to their mislocation relative to the coast. 5.2.2. Methods The archaeological geospatial databases e the professional database (ArchSite) and the public database (LINZ-Pa-pts) e underwent a number transformations and filters to generate vector layer representing fortification distribution. These steps are summarized below: 1) Frequency of All Known Sites (ArchSite); 2) Fortifications from Professional Records (ArchSite); and 3) Counting Fortifications from Public Records (LINZ-Pa-pts). Frequency of All Known Sites (ArchSite). A transitional shapefile was created where the frequency of all sites currently recorded in ArchSite within an individual polygon of the NZ map grid was calculated as well as the density of site records. Fortifications from Professional Records (ArchSite). Another transitional shapefile was created where the frequency of fortification sites currently listed in ArchSite within an individual polygon of the NZ map grid was calculated as well as the density of site records. Counting Fortifications from Public Records (LINZ-Pa-pts). A final shapefile was created where the frequency of publically known sites (LINZ, NZ topo maps) within an individual polygon of the NZ map grid was calculated as well as the density of site records. 5.2.3. Results The results show how surprisingly poorly a public geospatial database represents the actual geographic distribution and density of fortifications. Fortifications are present in ~20% of all map grid polygons. To put that in more meaningful terms, at any given place in New Zealand there is a 20% chance that a fortification is within a 1e2 h walk. The public records of fortifications alone would underestimate how common fortifications are across space

Fig. 8. Distribution of Public and Professional Site Records of Fortifications. Note that site locations are masked by using a polygon layer representing the NZ map grid. Sources: LINZ, ArchSite.

Please cite this article in press as: McCoy, M.D., Geospatial Big Data and archaeology: Prospects and problems too great to ignore, Journal of Archaeological Science (2017), http://dx.doi.org/10.1016/j.jas.2017.06.003

M.D. McCoy / Journal of Archaeological Science xxx (2017) 1e21

dramatically. There is even less correspondence when we compare the density of fortifications in the public and professional databases. It is exceedingly rare that the two correspond to within 10% of one another; this is true for only with less than 2% of locations. This same data to infer that a further ~30% of New Zealand has no fortification recorded, although other sites have been recorded in that area, and an additional half of country has no archaeological site recorded in the professional site record system. 5.3. Remote sensing GBD and the mapping of fortifications (Pa) GBD can be used to create 3D models of fortifications of New Zealand and are a good example of how embryonic geospatial datasets can improve without necessarily getting ‘big.’ The NZAA's ArchSite database includes links to site maps, and for most fortifications, these come from a brief site visit many years ago when a not-to-scale map was drawn of the defenses (ditches, banks) and internal earthworks (terraces, pits). These are extremely valuable as in many cases they are our only field map of the site. Site records are also updated, where possible, with GPS locations of major features, and there has been an effort to include air photos (Jones, 2007). For this use of GBD, I wanted to show how airborne LiDAR from immediately in and around a fortification can also be used to visualize the site, but more importantly, how that can grow the size of a single site record. 5.3.1. Sources of data Since the purpose of this study is to show how new visualizations of a fortification can grow the size of a geodatabase I began by selecting a site that I created a DEM using airborne LiDAR data several years ago, called Puketona Pa (Fig. 9). Puketona Pa is classed as a ‘Hill Pa’ on an isolated natural hill above the Waitangi River in the Northland District. The features present (terraces, defensive ditches, pits) are typical of fortifications (pa) and the fortified internal area at the summit covers ~20,000 m2 putting it in the large size class (5000e40,000 m2), putting it in with fortifications that “may serve as either internal or external power bases” (Marshall, 2004:77). Oral history describes the occupation of the site in the generations prior to European contact and confirms it was indeed a political center of some importance. The purpose of creating the LiDAR derived DEM was to have remotely sensed imagery in advance of a site visit. The original source of the airborne LiDAR was a survey funded by the Northland Regional Council in the wake of a serious flooding of the Waitangi River in 2007. The DEM was created using Nearest Neighbor function at a high resolution (0.25  0.25 m) in order to define the edges of earthworks and covered an around of about 90 ha. To reduce the size of down-stream digital products in this visualization, and since the site occupies a much smaller portion of that area, a selection of 2 ha was used here reducing the baseline DEM from greater than 60 MB to less than 8 MB. 5.3.2. Methods The DEM of Puketona Pa underwent a number transformations to generate layers representing the fortification. These steps are summarized below: 1) Deriving Relief Visualizations; and 2) 3D Model Fly-Through. Deriving Relief Visualizations. To create a number of different types of visualizations I used the Relief Visualization Toolbox (RVT) (http://iaps.zrc-sazu.si/en/rvt#v) created by Kokalj and the team at the Institute of Anthropological and Spatial Studies (Kokalj et al., 2013). The standalone executable version of the toolkit was used to make, in just over 1 min of processing time, 11 functions: Analytical hillshading (HS), Hillshading from multiple directions (AHS), PCA of hillshading (PCA), Slope gradient (SLOPE), Simple

15

local relief model (SLRM), Sky-View Factor (SVF), Anisotropic SkyView Factor (SVF-A), Openness e Positive (OPEN-POS), Openness e Negative (OPEN-NEG), Sky illumination (SIM), and Local dominance (LD). Each function produced two versions (32-bit, 8-bit), for a total of 22 individual rasters. Almost all rasters were useful representations of the site (Fig. 10); only the Sky illumination (SIM) and Local dominance (LD) functions, when applied at RTV's default settings, did not produce rasters with enough variance for visualization. 3D Model Fly-Through. To create an example of a 3D model fly through of Puketona Pa, I first created a slope layer for the original DEM and then clipped it to a circle (460 m diameter) around the center of the site. The resulting layer was opened in ESRI's ArcScene to create a ‘fly-through’ video (.avi) where z-dimension was derived from the DEM. The short video (25 s, 1048  796) is a brief tour around the site (Fig. 11). 5.3.3. Results The methods and digital products here are not new to archaeology - in fact many innovative uses of 3D data have specifically targeted hillforts (for a recent PhD dissertation on the topic, see O'Driscoll, 2016) - and it is well-established that it “can be beneficial to carry out a more detailed desktop survey using lidar data and other sources, such as standard aerial photographs, and taking this information into the field instead of, or together with, the simple lidar derived imagery” (English Heritage, 2010:38). What might be surprising is that while the files created are a great deal larger than the scans of paper files currently stored on the ArchSite website (an order of magnitude in the case of 32-bit rasters and the video), they are not that large in absolute terms. If we scaled these up to include similar files for all þ7000 fortifications, it would be less than a few terabytes, and scaled up again to the þ64,000 sites in all of New Zealand, it would fit on a handful of servers. Having said that, if one automated the production of these visualizations, there would have to be a way to filter out results like the SIM and LD to keep from creating a lot of raster files with no useable image (See Table 3). What this back-of-the-envelope calculation demonstrates is that although LiDAR and other remote sensing data are genuinely unwieldy GBD, the types of digital products that are of direct interest to archaeology need not be. In other words, it is possible to reap the benefit of large volume GBD (LiDAR) without necessarily increasing the volume of archaeological GBD to a point where they become equally unwieldly. 6. Discussion: standalone quality reporting I have been guided in this paper by three questions: 1) What kinds of geospatial data are available today? 2) How will larger and more accessible geospatial databases shape the near future of archaeology? And, using a case study from New Zealand, 3) What can we do now about our apprehensions regarding data quality, privacy, and volume? First, we are a field that values locational information but archaeology is also a profession with an obligation to keep that information away from those who would misuse it. Even with that important caveat, one would think that it would be straightforward to identify large, complex, geospatial datasets on site locations, artifacts, architecture, radiocarbon dates, and so on, given trends like web based GIS, open access publishing, volunteer geography, and Big Data science. There are indeed some outstanding archaeological geodatabases out there that are a testament to tenacity, and generosity, of those who create and contribute to them. But, on the whole, these are small oases in what continues to be a ‘data desert’.

Please cite this article in press as: McCoy, M.D., Geospatial Big Data and archaeology: Prospects and problems too great to ignore, Journal of Archaeological Science (2017), http://dx.doi.org/10.1016/j.jas.2017.06.003

Fig. 9. Web Maps with Puketona Pa (Site P05/214). Sources (top to bottom): OpenStreetMap, Google Earth, Topomapnz, Archsite (public view).

Please cite this article in press as: McCoy, M.D., Geospatial Big Data and archaeology: Prospects and problems too great to ignore, Journal of Archaeological Science (2017), http://dx.doi.org/10.1016/j.jas.2017.06.003

M.D. McCoy / Journal of Archaeological Science xxx (2017) 1e21

17

Fig. 10. Screen Capture from Video Fly Through of 3D Model of Puketona Pa. Areas of high slope (white) contrast with flat areas (black).

Fig. 11. Terrain Relief Visualizations of Puketona Pa. From clockwise from top left: slope (64 to 1), multi-hs, pca, srlm, open-neg, open-pos (150e58), svf-a (1e0.4), svf (255e0), and field map from site record (not-to-scale).

Second, what happens when our ‘data desert’ turns in to a ‘data ocean’ is largely up to us and how we deal with the datasets we have and collect today. As spatial technology has evolved and become integrated in to the practice of archaeology, we are dealing with challenges posed by the sheer size and complexity of data we use and produce. Field survey and excavations regularly yield far more pieces of spatial information than ever before. At the same time, the amount of available satellite imagery, airborne lidar, and

other remote sensing and environmental datasets increase in size and complexity. In this context, sharing data will become an ever more delicate balance between privacy, quality, and the potential benefits of being able to show the world what we have found. We can see this especially in how we construct culture histories, how we make decisions using archaeological geospatial data, and visualize archaeology for research and for the public. Third, the case study I presented here is a good example of how

Please cite this article in press as: McCoy, M.D., Geospatial Big Data and archaeology: Prospects and problems too great to ignore, Journal of Archaeological Science (2017), http://dx.doi.org/10.1016/j.jas.2017.06.003

18

M.D. McCoy / Journal of Archaeological Science xxx (2017) 1e21

Table 3 Estimated Storage Size of Visualizations. The digital products from this study for Puketona Pa would take the current 1 MB of files associated with the site up to 469 MB.

Current ArchSite (scans of paper files) DEM (region, 90 ha) DEM (large site, 2 ha) Relief Visualization (32-bit) Relief Visualization (8-bit) Fly-Through Video (desktop, 25 s)

Puketona Pa Data Stored (MB)

Estimated Total for All Fortifications (GB)

Estimated Total for All Sites (TB)

1 63 8 119 25 254

9 461 56 870 186 1858

0.1 4 1 8 2 17

Table 4 Suggested Questions for Standalone Quality Reports on the Use of Geospatial Big Data in Archaeology. The specific data qualities listed here were modified from Merino et al. (2016). This type of document is in line with the London Charter (Denard, 2012) principle of documentation for computer-based visualization of cultural heritage.

Contextual

Qualities

Adequacy Question

Improvement Question

relevant and complete

What is the intellectual justification for using this data for the task?

unique and semantically interoperable

Are there known duplicate or semantically redundant records in the new data? How does the data represent real entities?

What changes from the primary data were made, if any, to assure the data was relevant and complete? What changes were made, if any, to eliminate duplicate or semantically redundant records? What changes were made, if any, to how the data represent real entities? What data, if any, was filtered due to a question of credibility? What has been done to protect the primary data from inappropriate use? Does the newly derived data meet relevant regulations and standards? What changes, if any, where made to how data is grouped by time periods? How does the database identify when was data collected? Has the primary data been updated/modified for the task at hand? Have changes been made to make space-time trend analysis possible? Were changes made to make time represented coherently? How, if at all, did logistical barriers to accessing primary data shape this new data? What restrictions are there on this new data for secondary use? What improvements were made to overcome technical barriers to working with the data? Was the native data model changed for the intended task? Were previous changes to primary data known to the creators of this new dataset?

semantically accurate credible

Temporal

confidential

What criteria were used to assess the credibility of data relative to the task? Was the primary data accessible to the analyst?

compliant

Does the data meet regulations and standards?

time-concurrent

How is the data grouped by time periods?

current

frequent

Is original data an archived (static) or integrative database (updated)? Are there known data from ongoing project or other new data not included in this database? Is space-time trend analysis possible?

time-consistent

Is time represented coherently?

available, recoverable, accessible

What logistical barriers were there to gaining access to the data? What restrictions have the stewards of the original data placed on the data? Are there technical barriers to working with the data? Is the native data model appropriate to the intended task? How can one trace access and changes to the data?

timely

Operational

authorized similar data types, precision, portable efficient traceable

we might take a data quality-in-use approach to GBD in archaeology. In the first example, I used an archived geodatabase of radiocarbon dates to estimate the spread of populations and fortifications in New Zealand. In the analysis I discovered a number of data quality problems: a lack of accurate site location data meant many radiocarbon dates could not be compared with environmental data; site types used in the database were not ideal for the project and contained typos; calibration and assigning radiocarbon dates to century-scale temporal bins raised a number of issues and required lowering quality protocols; and interpolation was influenced by an edge effect along the coast that again was complicated by a lack of site location accuracy. As described above, some of these problems were solved by filtering problematic data out of the analysis and/or improving the dataset by creating new fields, such as the grouped site type and time periods. The second example also demonstrates how the need to keep archaeological site locations confidential complicates analysis. In that case I lowered the resolution of the information on the two site databases in the digital end products (i.e., maps, data) by summarizing them using an arbitrary grid (NZ Map Grid). I view that as an improvement in that

it makes it possible to share the results without putting archaeological sites in greater danger of looting. In my last example, I created a number of digital representations of a single site (Puketona Pa) that are ultimately derived from airborne LiDAR. I found that while the underlying LiDAR data is large, that including visualizations based on relief and video improved the site record without necessarily creating an insurmountable problem with data volume. This information on the adequacy of the datasets used, and the improvements made, are absolutely necessary for anyone who would use the original dataset or the digital products of the research presented here. To that end, I suggest we need to require new analyses using GBD come with a Standalone Quality Report. The International Organization for Standardization (ISO) reference on geographic information e data quality (ISO 19157:2013(E)) defines a standalone quality report as “free text document providing fully detailed information about data quality … evaluations, results and measures used.” In the case of archaeology, a data quality report would need to describe information about original archaeological geospatial dataset and the data derived for a specific task such as

Please cite this article in press as: McCoy, M.D., Geospatial Big Data and archaeology: Prospects and problems too great to ignore, Journal of Archaeological Science (2017), http://dx.doi.org/10.1016/j.jas.2017.06.003

M.D. McCoy / Journal of Archaeological Science xxx (2017) 1e21

archaeological research, assessment, documentation, or visual representation. It would need to be supplementary to administrative and technical information contained in metadata; in technical terms the report would paradata (data published alongside data). I would find narrative descriptions of the adequacy of existing data and improvements useful in a Standalone Quality Report, but we would also do well to include some standardized questions that will make future searches easier (Kintigh, 2015). Table 4 gives an example of the types of questions that might be included in a Standalone Quality Report. One set examines the initial adequacy of the existing geodatabase for the task described, the second asks about the improvements made, if any, for the study. The quality categories are adapted based on suggested evaluations of Big Data in terms of contextual, temporal, and operational adequacy and improvements (Merino et al., 2016). Interestingly, Big Data science has become increasingly concerned with how it performs in the temporal dimension (e.g., when does the data pertain to, when was it recorded) in part because of the volume of streaming data. I will be the first to say that the data quality-in-use approach sounds a great deal like opening the door to messy research done with big, bad data e and this is why the Standalone Quality Report is so important. We need to use our core disciplinary values regarding data quality and privacy and apply them to GBD. A quality report will increase the visibility of our siloed datasets and talk about why that information is siloed in ways that make it more visible and resilient to being lost to those who would deny scientific data. Equally importantly, if we begin to publish our data with standalone quality reports where we acknowledge the improvements we are making, we are encouraging an professional culture were we work cumulatively, improving upon geospatial data, rather than creating datasets that are used for one task and then abandoned. 7. Conclusions I describe geospatial datasets in archaeology today as oases in a ‘data desert,’ that would one day become a ‘data ocean’ as we create more, and larger, datasets that are more easily discoverable and accessible. The most immediate pay off of this trend will be in terms of having access to all available relevant data. Due to natural taphonomic processes and modern development, the archaeological record is by definition an incomplete set of material evidence; evidence we destroy by excavating. Not only is it in our interest to know what information already exists that might answer our questions, without it we are more likely to needlessly destroy sites in pursuit of redundant data. The question of how to deal with GBD touches all our ethical obligations (e.g., SAA principles: stewardship, accountability, commercialization, public education and outreach, intellectual property, public reporting and publication, records and preservation, training and resources). We are seeing progress on this front in the success of archival databases (ADS, tDAR) and the promotion of best practices through the responsible use of available integrated databases. As these evolve, it would be excellent to see geoportal data platforms that connect related databases; that is, healthy data silos that are visible, but not fully accessible. The development of GBD in archaeology offers new ways to engage with other scientists, stakeholder communities, and the public. I have no doubt that GBD will be the basis for writing culture histories in the foreseeable future, and it will be the way forward for interdisciplinary research as other fields grow to discover all that we have already learned. These same efforts will reveal more ways to involve the public as participants in discovery and stewardship. These benefits e data access, ethics, and broader engagement e

19

to a skeptical ear will sound like the same platitudes that are dragged out every time archaeology discovers a new digital toy to play with. But, I would argue, that there are good reasons to think that these benefits will materialize. My optimism comes from examples of GBD that already exist and the parallel trends that are directly or indirectly dealing with GBD, in terms of cyberinfrastructure (Yang et al., 2010), and spatial analyses (Ortman et al., 2015). We must get on with the task of using massive amounts of descriptive geospatial data in a fashion that is scientific (testable, replicable, etc.), authentic (a faithful representation of our observations on the archaeological record and the human past), and ethical (protection of cultural resources). Information on the adequacy of the datasets used, and improvements made, are absolutely necessary if we are to deal with apprehensions over data quality, privacy, and volume. I suggest we need to require new analyses using GBD come with a Standalone Quality Report to go beyond the usual administrative metadata. This small change is something that can be implemented now to the benefit of large and growing body of knowledge about the human past. Acknowledgements Thanks to Marieka Brouwer-Burg and Meghan Howey for their invitation the SAA session and for organizing this special issue. This paper has evolved thanks to lively discussions with my colleagues in Southern Methodist University's Big Data Working Group and a grant from the Maguire Ethics Center. Special thanks to all my colleagues and students who have helped me form the opinions expressed here: Michael Aiuvalasit, Nick Belluzzo, Simon Bickler, Emma Brooks, Steve Burrow, Jesse Casana, Maria Codlin, Ann Horsburgh, Ian Jorgeson, James Flexner, David Meltzer, Andrew Martindale, Stace Maples, Thegn Ladefoged, Cliff Patterson, Leslie Reeder-Myers, Nico Tripcevich, Robert Wayumba, and Joshua Wells. Trying to capture the evolution of spatial technology in archaeology at any one time is a daunting task and I am grateful to three anonymous reviewers who helped guide my search and press me to look forward to the future. Appendix A. Supplementary data Supplementary data related to this article can be found at http:// dx.doi.org/10.1016/j.jas.2017.06.003. References Allen, M.W., 1994. Warfare and Economic Power in Simple Chiefdoms: the Development of Fortified Villages and Polities in Mid-Hawke’s Bay, New Zealand. Department of Anthropology, UCLA, University Microfilms, Ann Arbor. Unpublished PhD. dissertation. Allen, M.W., 1996. Pathways to economic power in Maori chiefdoms: ecology and warfare in prehistoric Hawke's Bay. Res. Econ. Anthropol. 17, 171e225. Allen, M.W., 2006. Transformations in Maori warfare: Toa, pa, and pu. In: Arkush, E.N., Allen, M.W. (Eds.), The Archaeology of Warfare: Prehistories of Raiding and Conquest. University Press of Florida, Gainesville, pp. 184e213. Allen, M.W., 2008. Hillforts and the cycling of Maori chiefdoms: Do good fences make good neighbors? In: Railey, J.A., Reycraft, R.M. (Eds.), Global Perspectives on the Collapse of Complex Systems. Maxwell Museum of Anthropology (Anthropological Papers No. 8), Albuquerque, pp. 65e81. Attenbrow, V., Hiscock, P., 2015. Dates and demography: are radiometric dates a robust proxy for long-term prehistoric demographic change? Archaeol. Ocean. 50, 29e35. Bevan, A., 2015. The data deluge. Antiquity 89 (345), 1473e1484. Barber, I.G., 1996. Loss, change, and monumental landscaping: Towards a new interpretation of the “classic” Maori emergence. Curr. Anthropol. 37 (5), 868e880. Barton, M., 2013. Stories of the past or science of the future? Archaeology and computational social science. In: Bevan, A., Lake, M. (Eds.), Computational Approaches to Archaeological Spaces. Left Coast Press, Walnut Creek, pp. 151e178.

Please cite this article in press as: McCoy, M.D., Geospatial Big Data and archaeology: Prospects and problems too great to ignore, Journal of Archaeological Science (2017), http://dx.doi.org/10.1016/j.jas.2017.06.003

20

M.D. McCoy / Journal of Archaeological Science xxx (2017) 1e21

Barton, M.C., Ullah, I., Mitasova, H., 2010. Computational modeling and Neolithic socioecological dynamics: a case study from Southwest Asia. Am. Antiq. 75 (2), 364e386. Bamforth, D.B., Grund, B., 2012. Radiocarbon calibration curves, summed probability distributions, and early Paleoindian population trends in North America. J. Archaeol. Sci. 39, 1768e1774. Bodenhamer, D.J., Corrigan, J., Harris, T.M., 2010. The Spatial Humanities: GIS and the Future of Humanities Scholarship. Indiana University Press, Bloomington. Bonacchi, C., Bevan, A., Pett, D., Keinan-Schoonbaert, A., 2015. Crowd- and community-fuelled archaeological research. Early results from the MicroPasts project. In: Proceedings of the Conference Computer Applications and Quantitative Methods in Archaeology, pp. 279e288. Casana, J., 2015. Satellite imagery-based analysis of archaeological looting in Syria. Near East. Archaeol. 78 (3), 142e152. Contreras, D.A., Brodie, N., 2010. The utility of publicly-available satellite imagery for investigating looting of archaeological sites in Jordon. J. Field Archaeol. 35 (1), 101e114. Chaput, M.A., Kriesche, B., Betts, M., Martindale, A., Kulik, R., Schmidt, V., Gajewski, K., 2015. Spatiotemporal distribution of Holocene populations in North America. Proc. Natl. Acad. Sci. U. S. A. 112, 12127e12132. Clark, G.R., Reepmeyer, C., Melekiola, N., Woodhead, J., Dickinson, W.R., MartinssonWallin, H., 2014. Stone tools from the ancient Tongan state reveal prehistoric interaction centers in the Central Pacific. Proc. Natl. Acad. Sci. U. S. A. 111, 10491e10496. Contreras, D.A., Meadows, J., 2014. Summed radiocarbon calibrations as a population proxy: A critical evaluation using a realistic simulation approach. J. Archaeol. Sci. 52, 591e608. Colwell, C., 2016. How the archaeological review behind the Dakota Access Pipeline went wrong. The Conversation. https://theconversation.com/how-thearchaeological-review-behind-the-dakota-access-pipeline-went-wrong-67815. Cooper, A., Green, C., 2015. Embracing the complexities of ‘Big Data’ in archaeology: The case of the English Landscape and Identities Project. J. Archaeol. Method & Theory 23, 271e304. , P., Robinson, E., 2014. 14C dates as demographic proxies in Neolithisation Crombe models of northwestern Europe: A critical assessment using Belgium and northeast France as a case-study. J. Archaeol. Sci. 52, 558e566. Dakota Access, LLC, 2016. Dakota Access Pipeline Project Section 408 Consent for Crossing Federally Authorized Projects and Federal Flowage Easements. Report prepared for U.S. Corp of Engineers. https://assets.documentcloud.org/ documents/3036302/DAPLSTLFINALEAandSIGNEDFONSI-3Aug2016.pdf. Davidson, J.M., 1984. The Prehistory of New Zealand. Longman Paul, Auckland. Delgado, M., Aceituno, F.J., Barrientos, G., 2015. 14C and the early colonization of Northwest South America: A critical assessment. Quat. Int. 363, 55e64. Dobbs, G.R., Louis, R.P., 2015. Geospatial technologies and indigenous communities engagement. Int. J. Appl. Geospat. Res. 6 (1), ivexiii. € rter, G., Davis, L., 2013. Bridging geographic information systems (GIS) into the Do museum world. Digit. Herit. Int. http://dx.doi.org/10.1109/ DigitalHeritage.2013.6743843. Denard, H., 2012. A New Introduction to the London Charter. In: BentkowskaKafel, A., Baker, D., Denard, H. (Eds.), Paradata and Transparency in Virtual Heritage Digital Research in the Arts and Humanities Series. Ashgate, pp. 57e71. Dye, T.S., 2015. Dating human dispersal in Remote Oceania: a Bayesian view from Hawaii. World Archaeol. 47, 661e676. Earle, T., 1997. How Chiefs Come to Cower the Political Economy in Prehistory. Stanford University Press, Stanford. English Heritage, 2010. The Light Fantastic: Using airborne lidar in archaeological survey. https://content.historicengland.org.uk/images-books/publications/ light-fantastic/light-fantastic.pdf/. Field, J.S., Petraglia, M., Lahr, M.M., 2007. The southern dispersal hypothesis and the South Asian archaeological record: Examination of dispersal routes through GIS analysis. J. Anthropol. Archaeol. 26, 88e108. Gaffney, V., van Leusen, M., 1995. Postscript e GIS, environmental determinism and archaeology: a parallel text. In: Lock, G.R., Stancic, Z. (Eds.), Archaeology and Geographic Information Systems: a European Perspective. Taylor & Francis, London, pp. 367e382. Gifford-Gonzalez, D., 2016. Letter to Lieutenant General Todd Semonite, Commanding General and Chief of Engineers. 13 September 2016. http://www.saa. org/Portals/0/SAA/GovernmentAffairs/DAPL_LETTER.pdf. Goldberg, A., Mychajliw, A.M., Hadly, E.A., 2016. Post-invasion demography of prehistoric humans in South America. Nature 532, 232e235. Goodchild, M.F., 2007. Citizens as sensors: the world of volunteered geography. GeoJournal 69 (4), 211e221. Green, R.C., 1964. Carbon-14 dating. Curr. Anthropol. 5 (5), 428e429. Hitchcock, A., 2006. FOIA and protecting cultural resources. In: Harmon, D. (Ed.), People, Places, and Parks: Proceedings of the 2005 George Wright Society Conference on Parks, Protected Areas, and Cultural Sites. The George Wright Society, Hancock, Michigan, pp. 468e474. Horsburgh, K.A., Orton, J., Klein, R.G., 2016. Beware the Springbok in Sheep's Clothing: How Secure Are the Faunal Identifications upon Which We Build Our Models? Afr. Archaeol. Rev. 1e9. Horton, M., 2014. Join an Archaeological Dig... Courtesy of the Internet of Things. Huffpost Tech (updated 16 Nov 2014). http://www.huffingtonpost.co.uk/markhorton/join-an-archaeological-di_b_5827698.html. Hosner, D., Wagner, M., Tarasov, P.E., Chen, X., Leipe, C., 2016. Spatialtemporal distribution patterns of archaeological sites in China during the Neolithic and

Bronze Age: An overview. Holocene 26 (10), 1576e1593. Huggett, J., 2014. Promise and paradox: accessing open data in archaeology. In: Mills, C., Pidd, M., Ward, E. (Eds.), Proceedings of the Digital Humanities Congress 2012. Series: Studies in the Digital Humanities. HRI Online Publications, Sheffield. Huggett, J., 2015a. Digital haystacks: open data and the transformation of archaeological knowledge. In: Wilson, A.T., Edwards, B. (Eds.), Open Source Archaeology: Ethics and Practice. De Gruyter Open, pp. 6e29. http://dx.doi.org/ 10.1515/9783110440171-003. ISBN 9783110440171. Huggett, J., 2015b. A manifesto for an introspective digital archaeology. Open Archaeol. 1 (1), 86e95. Huggett, J., 2016. Biggish Data. Introspective Digital Archaeology Blog. https:// introspectivedigitalarchaeology.wordpress.com/2016/05/20/biggish-data/. International Organization for Standardization, 2013. Geographic Information e Data Quality (Geneva, Switzerland). https://www.iso.org. Jelinek, A.J., 1962. An index of radiocarbon dates associated with cultural materials. Curr. Anthropol. 3 (5), 451e477. Jones, K.L., 2007. The Penguin Field Guide to New Zealand Archaeology. Penguin, Auckland. Karmas, A., Tzotosos, A., Karantzalos, K., 2016. Geospatial Big Data for Environmental and Agricultural Applications. In: Yu, S., Guo, S. (Eds.), Big Data Concepts, Theories, and Applications. Springer, New York. Kansa, E.C., Kansa, S.W., Arbuckle, B., 2014. Publishing and pushing: mixing models for communicating research data in archaeology. Int. J. Digital Curation 9 (1), 57e70. Kintigh, K., 2006. The Promise and Challenge of Archaeological Data Integration. Am. Antiq. 71 (3), 567e578. Kintigh, K.W., 2015. Extracting information from archaeological texts. Open Archaeol. 1, 96e101. Kintigh, K.W., Altschul, J.H., Beaudry, M.C., Drennan, R.D., Kinzig, A.P., Kohler, T.A., Limp, W.F., Maschner, H.D.G., Michener, W.K., Pauketat, T.R., Peregrine, P., Sabloff, J.A., Wilkinson, T.J., Wright, H.T., Zeder, M.A., 2014. Grand challenges for archaeology. Am. Antiq. 79 (1), 5e24.  Zaksek, K., Ostir, K., 2013. Visualizations of Lidar Derived Relief Models. Kokalj, Z., In: Opitz, Rachel, David Cowley, C. (Eds.), Interpreting Archaeological Topography e Airborne Laser Scanning, Aerial Photographs and Ground Observation. Oxbow Books, Oxford, pp. 100e114. Laney, D., 2001. 3D Data Management: Controlling data volume, velocity, and variety. Application Delivery Strategies. https://blogs.gartner.com/doug-laney/ files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocityand-Variety.pdf. Lawrence, D., Philip, G., Hunt, H., Snape-Kennedy, L., Wilkinson, T.J., 2016. Long Term Population, City Size and Climate Trends in the Fertile Crescent: A First Approximation. PLoS ONE 11 (3), e0152563. http://dx.doi.org/10.1371/ journal.pone.0152563. Leathwick, J.R., 2000. Predictive Models of Archaeological Site Distributions in New Zealand. Science and Research Internal Report 181. Department of Conservation, Wellington. Lee, J., Kang, M., 2015. Geospatial Big Data: Challenges and opportunities. Big Data Res. 2, 74e81. Lemmen, C., 2012. Different mechanisms shaped the transition to farming in Europe and the North American Woodland. Archaeol. Ethnol. Anthropol. Eurasia 41 (3), 48e58. Li, S., Dragicevic, S., Castro, F.A., Sester, M., Winter, S., Coltekin, A., Pettit, C., Jiang, B., Haworth, J., Stein, A., Cheng, T., 2016. Geospatial big data handling theory and methods: A review and research challenges. ISPRS J. Photogram. Remote Sens. 115, 119e133. Long, T., Taylor, D., 2015. A revised chronology for the archaeology of the lower Yangtze, China, based on Bayesian statistical modeling. J. Archaeol. Sci. 63, 115e121. Maaten, L., Boon, van der P., Lange, G., Paijmans, H., Postma, E., 2007. Computer Vision and Machine Learning for Archaeology. In: Clark, J.T., Hagemeister, E.M. (Eds.), Digital Discovery. Exploring New Frontiers in Human Heritage. CAA2006. Computer Applications and Quantitative Methods in Archaeology. Proceedings of the 34th Conference, Fargo, United States, April 2006. Archaeolingua, Budapest, Pp. CD-ROM, pp. 476e482. Marshall, Y., 2004. Social organization. In: Furey, L., Holdaway, S. (Eds.), Change through Time: 50 Years of New Zealand Archaeology, vol. 26. New Zealand Archaeological Association Monograph, pp. 55e84. Martindale, A., Morlan, R., Betts, M., Blake, M., Gajewski, K., Chaput, M., Mason, A., Vermeersch, P., 2016. Canadian Archaeological Radiocarbon Database (CARD 2.1) (Accessed 7 November 2016). McCoy, M.D., Ladefoged, T.N., 2009. New developments in the use of spatial technology in archaeology. J. Archaeol. Res. 17 (3), 263e295. McFadgen, B.G., Knox, F.B., Cole, T.R.L., 1994. Radiocarbon calibration curve variations and their implications for the interpretation of New Zealand prehistory. Radiocarbon 36, 221e236. McLaughlin, T.R., Whitehouse, N.J., Schulting, R.J., McClatchie, M., Barratt, P., Bogaard, A., 2016. The Changing Face of Neolithic and Broze Age Ireland: A Big Data Approach to the Settlement and Burial Records. J. World Prehist. 29, 117e153. Merino, J., Caballero, I., Rivas, B., Serrano, M., Piattni, 2016. A data quality in use model for big data. Future Gener. Comput. Syst. 63, 123e130. Miller, D.S., 2016. Modeling Clovis landscape use and recovery bias in the Southeastern Unites States using the Paleoindian Database of the Americas (PIDBA).

Please cite this article in press as: McCoy, M.D., Geospatial Big Data and archaeology: Prospects and problems too great to ignore, Journal of Archaeological Science (2017), http://dx.doi.org/10.1016/j.jas.2017.06.003

M.D. McCoy / Journal of Archaeological Science xxx (2017) 1e21 Am. Antiq. 81 (4), 697e716. Mills, B.J., Clark, J.J., Peeples, M.A., Haas, W.R., Roberts, J.R., Hill, J.B., Huntley, D.L., Borck, L., Breiger, R.L., Clauset, A., Shackley, M.S., 2013. Transformation of social networks in the late pre-Hispanic US Southwest. Proc. Natl. Acad. Sci. U. S. A. 110 (15), 5785e5790. Mulrooney, M.A., 2013. An island-wide assessment of the chronology of settlement and land use on Rapa Nui (Easter Island) based on radiocarbon data. J. Archaeol. Sci. 40, 4377e4399. O'Driscoll, J., 2016. The Baltinglass Landscape and the Hillforts of Bronze Age Ireland. PhD Thesis. University College Cork. Ortman, S.G., Cabaniss, A.H.F., Strurm, J.O., Bettencourt, L.M.A., 2015. Settlement scaling and increasing returns in ancient society. Sci. Adv. 1, e1400066. Peros, M., Munoz, S., Gajewski, K., Viau, A., 2010. Prehistoric demography of North America inferred from radiocarbon data. J. Archaeol. Sci. 37, 656e664. Rika-Heke, M., 2010. Archaeology and indigeneity in Aotearoa/New Zealand: Why do Maori not engage with archaeology? In: Phillips, C., Allen, H. (Eds.), Bridging the Divide: Indigenous Communities and Archaeology into the 21st Century. Left Coast Press, Walnut Creek, CA, pp. 197e212. Russell, T., Silva, F., Steele, J., 2014. Modelling the Spread of Farming in the BantuSpeaking Regions of Africa: An Archaeology-Based Phylogeography. PLoS ONE 9 (1), e87854. http://dx.doi.org/10.1371/journal.pone.0087854. Shennan, S., Downey, S.S., Timpson, A., Edinborough, K., Colledge, S., Kerig, T., et al., 2013. Regional population collapse followed initial agriculture booms in midHolocene Europe. Nat. Commun. 4, 2486. http://dx.doi.org/10.1038/ ncomms3486. Schmidt, M., 1996. The commencement of pa construction in New Zealand prehistory. J. Polyn. Soc. 105 (4), 441e451. Silva, F., Stevens, C.J., Weisskopf, A., Castillo, C., Qin, L., Bevan, A., et al., 2015. Modelling the Geographical Origin of Rice Cultivation in Asia Using the Rice Archaeological Database. PLoS ONE 10 (9), e0137024. http://dx.doi.org/10.1371/ journal.pone.0137024. Silva, F., Steele, J., 2014. New methods for reconstructing geographical effects on dispersal rates and routes from large-scale radiocarbon databases. J. Archaeol. Sci. 52, 609e620. http://dx.doi.org/10.1016/j.jas.2014.04.021. Steele, J., 2010. Radiocarbon dates as data: quantitative strategies for estimating colonization front speeds and event densities. J. Archaeol. Sci. 37 (8), 2017e2030. http://dx.doi.org/10.1016/j.jas.2010.03.007. Snow, D.R., Gahegan, M., Giles, C.L., Hirth, K.G., Milner, G.R., Mitra, P., Wang, J.Z., 2006. Cybertools and Archaeology. Science 311, 958e959. South, S., 1977. Method and Theory in Historical Archaeology. Academic Press. Smith, I., 2010. Protocols for organizing radiocarbon dated assemblages from New Zealand archaeological sites for comparative analysis. J. Pac. Archaeol. 1 (2), 184e187. Spaulding, A.C., 1960. The dimensions of archaeology. In: Dole, G.E., Carneiro, R.L. (Eds.), Essays in the Science and Culture in Honor of Leslie a. White. Crowell, New York, pp. 437e456. Stone, E.C., 2008. Patterns of looting in Iraq. Antiquity 82, 125e138. Stone, E.C., 2015. An update on the looting of archaeological sites in Iraq. Near East. Archaeol. 78 (3), 178e186. Stuiver, M., Reimer, P.J., Reimer, R.W., 2017. CALIB 7.1 [WWW program] at. http:// calib.org (Accessed 26 May 2017). Suthaharan, S., 2014. Big data classification: problems and challenges in network intrusion prediction with machine learning. Perform. Eval. Rev. 41 (4), 70e73. Vayda, A.P., 1960. Maori Warfare: Polynesian Society Monographs, vol. 2. A.H. and A.W. Reed, Auckland. Walter, R., Jacomb, C., Bowron-Muth, S., 2010. Colonisation, mobility and exchange in New Zealand prehistory. Antiquity 84, 497e513. Williams, A.N., Ulm, S., Smith, M., Reid, J., 2014. AustArch: A Database of 14C and Non-14C Ages from Archaeological Sites in Australia - Composition, Compilation and Review (Data Paper). Internet Archaeol. 36 http://dx.doi.org/10.11141/ ia.36.6. Wilmshurst, J.M., Hunt, T.L., Lipo, C.P., Anderson, A.J., 2011. High-precision radiocarbon dating shows recent and rapid initial human colonization of East Polynesia. Proc. Natl. Acad. Sci. U. S. A. 108, 1815e1820. Wiseman, J., El-Baz, F., 2007. Remote Sensing in Archaeology. Springer-Verlag, New York. Yang, C., Raskin, R., Goodchild, M., Gahegan, M., 2010. Geospatial cyberinfrastructure: Past, present and future. Computers. Environ. Urban Syst. 34, 264e277. Zubimendi, M.A., Ambrustolo, P., Zilio, L., Castro, A., 2015. Continuity and discontinuity in the human use of the north coast of Santa Cruz (Patagonia Argentina) through its radiocarbon record. Quat. Int. 356, 127e146.

Web and Data References Site Databases, Atlases, & Archives

21

cba/. The Digital Archaeological Record (tDAR) http://core.tdar.org. ArchSite (New Zealand Archaeological Association’s site recording scheme):http:// www.archsite.org.nz. Digital Index of North American Archaeology (DINAA): http://ux.opencontext.org/ archaeology-site-data/. The Electronic Atlas of Ancient Maya Sites: a Geographic Information System (GIS): http://mayagis.smv.org. Digital Archaeological Archive of Comparative Slavery:http://www.daacs.org/. Open Context https://opencontext.org/. CORONA Atlas of the Middle East http://digitalhumanities.dartmouth.edu/projects/ the-corona-atlas-project/.  Antiquity A-la-carte http://awmc.unc.edu/wordpress/alacarte/. American Institute of Archaeology’s Archaeology of North America https://www. archaeological.org/news/aianews/6871. Paleoindian Database of the Americas (PIDBA)http://pidba.utk.edu/. English Landscapes and Identities Project https://englaid.com/er4. Comparative Archaeology Database (University of Pittsburg) http://www.cadb.pitt. edu/. US National Register of Historic Places (National Park Service) https://www.nps. gov/maps/full.html?mapId¼7ad17cc9-b808-4ff8-a2f9-a99909164466. Field Acquired Information Management System (FAIMS) https://www.fedarch.org/. European Union’s Inspire Geoportal http://inspire-geoportal.ec.europa.eu/. Pangaea www.pangaea.de. ArkeoGIS http://arkeogis.org/en/home. The Survey of Hillforts http://www.arch.ox.ac.uk/hillforts-atlas-survey.html.

Artifacts Portable Antiquities Scheme http://finds.org.uk/database.

Radiocarbon Vermeersch, P.M., 2016. Radiocarbon Palaeolithic Europe Database, Version 20. Available at. http://ees.kuleuven.be/geography/projects/14c-palaeolithic. Canadian Archaeological Radiocarbon Database (CARD) http://www. canadianarchaeology.ca/. New Zealand Radiocarbon Database http://www.waikato.ac.nz/nzcd/intro.html. Rapa Nui Interactive Radiocarbon Database http://data.bishopmuseum.org/C14/. € hner and Daniel Schyle, radiocarbon CONTEXT database 2002-2006 http:// Utz Bo context-database.uni-koeln.de/[http://dx.doi.org/10.1594/GFZ.CONTEXT.Ed1]. RADON - Central European and Scandinavian database of 14C dates for the Neolithic and Early Bronze Age http://radon.ufg.uni-kiel.de. Andes 14C: Radiocarbon Database for Bolivia, Ecuador and Peru http://andes-c14. arqueologia.pl/database.html.

Volunteer GIS Web-GIS Map of Day of Archaeology Posts https://jessogden.carto.com/me. A #NoDAPLMap: https://northlandia.wordpress.com. The Bakken Pipeline: https://bakkenpipelinemap.com. The Decolonial Atlas: https://decolonialatlas.wordpress.com. Pleiades e The Stoa Consortium: https://pleiades.stoa.org. Micropasts http://crowdsourced.micropasts.org. GlobalXplorer http://globalxplorer.org.

Spatial toolkit for visualization Relief Visualization Toolbox (RVT) http://iaps.zrc-sazu.si/en/rvt#v.

3D CyArk http://www.cyark.org/. Sketchfab https://sketchfab.com/.

Institutional Centers Stanford Geospatial Center, Harvard Geospatial Library http://library.stanford.edu/ research/stanford-geospatial-center. Ancient World Mapping Center (Brown) http://awmc.unc.edu/wordpress/. Center for Advanced Spatial Technologies (Arkansas) http://www.cast.uark.edu/. Institute of Anthropological and Spatial Studies, Slovakian Academy of Science and Arts http://iaps.zrc-sazu.si/en#v.

Archaeology Data Service http://archaeologydataservice.ac.uk/archives/view/c14_

Please cite this article in press as: McCoy, M.D., Geospatial Big Data and archaeology: Prospects and problems too great to ignore, Journal of Archaeological Science (2017), http://dx.doi.org/10.1016/j.jas.2017.06.003