Computers & Geosciences 75 (2015) 13–16
Contents lists available at ScienceDirect
Computers & Geosciences journal homepage: www.elsevier.com/locate/cageo
CCDST: A free Canadian climate data scraping tool Charmaine Bonifacio a, Thomas E. Barchyn b, Chris H. Hugenholtz b,n, Stefan W. Kienzle a,c a b c
Department of Geography, University of Lethbridge, 4401 University Drive, Lethbridge, AB, Canada T1K 3M4 Department of Geography, University of Calgary, 2500 University Drive NW, Calgary, AB, Canada T2N 1N4 Applied Behavioural Ecology and Ecosystems Research Unit, University of South Africa, PO Box 392, Florida, Pretoria, South Africa
art ic l e i nf o
a b s t r a c t
Article history: Received 11 December 2013 Received in revised form 16 October 2014 Accepted 18 October 2014 Available online 29 October 2014
In this paper we present a new software tool that automatically fetches, downloads and consolidates climate data from a Web database where the data are contained on multiple Web pages. The tool is called the Canadian Climate Data Scraping Tool (CCDST) and was developed to enhance access and simplify analysis of climate data from Canada's National Climate Data and Information Archive (NCDIA). The CCDST deconstructs a URL for a particular climate station in the NCDIA and then iteratively modifies the date parameters to download large volumes of data, remove individual file headers, and merge data files into one output file. This automated sequence enhances access to climate data by substantially reducing the time needed to manually download data from multiple Web pages. To this end, we present a case study of the temporal dynamics of blowing snow events that resulted in 3.1 weeks time savings. Without the CCDST, the time involved in manually downloading climate data limits access and restrains researchers and students from exploring climate trends. The tool is coded as a Microsoft Excel macro and is available to researchers and students for free. The main concept and structure of the tool can be modified for other Web databases hosting geophysical data. & Elsevier Ltd. All rights reserved.
Keywords: Climate data online Scraping tool Canada
1. Introduction Vast archives of climate data are publicly available through the Internet (e.g., Menne et al., 2012; Vincent et al., 2012), however, not all archives can be accessed efficiently. Often, considerable manual downloading is required, which delays analysis and adds considerable cost to projects (Thorne et al., 2011). Ideally, climate data should be easily accessible in a bulk format for rapid assessment and analysis. In response to this, significant progress has been made collating and distributing climatic data through various web portals. Efforts such as the Goddard Institute for Space Studies temperature record (Hansen et al., 2010), Global Historical Climate Network (Lawrimore et al., 2011; Menne et al., 2012), Climatic Research Unit temperature database (Jones et al., 2012), or Berkeley Earth (Rohde et al., 2013) provide global records of climate variables. A number of Canada-specific products have also been developed, such as the Adjusted and Homogenized Canadian Climate Data (Vincent and Gullett, 1999; Wan et al., 2010; Vincent et al., 2012), or various spatially interpolated products detailed in McKenney et al. (2011), also Hutchinson et al. (2009).
n
Corresponding author. E-mail address:
[email protected] (C.H. Hugenholtz).
http://dx.doi.org/10.1016/j.cageo.2014.10.010 0098-3004/& Elsevier Ltd. All rights reserved.
However, for some applications direct access to raw data is preferable. First, direct access reveals all the variables and stations measured. For example, some weather station records contain notes from manual observations, which are invaluable for analyzing phenomena such as dust storms (e.g., Fox et al., 2012) and other weather conditions that cannot be recorded by instruments. In contrast, some portals only serve certain data fields. For example, the Berkeley Earth project focuses primarily on temperature (Rohde et al., 2013), which restricts the types of analyses possible. Design criteria in large assimilation projects also mean that some stations are omitted (Vincent et al., 2012). Second, raw data are often available at finer timescales. Other portals only serve data on monthly or daily timescales, which are less useful for fine scale analyses such as analysis of extreme storms, which depends on hourly data (e.g., Hugenholtz, 2013). Third, data are directly measured, and this is important for local effects that homogenization can mask (Vincent et al., 2012). Although we note that homogenized data are important for some trend analyses (Rohde et al., 2013). Fourth, the data are usually up to date, limiting problems with delays until portals update their records. In Canada, direct public access to government-collected historical climate data is only available online through the National Climate Data and Information Archive (hereafter NCDIA; http:// climate.weatheroffice.gc.ca). To access data, users select the data interval (hourly, daily, or monthly), the date range, and the station name. The site returns a list of stations meeting the user-defined
14
C. Bonifacio et al. / Computers & Geosciences 75 (2015) 13–16
criteria and links to the data. Once the page with the data report is chosen, the user is directed to a new page with a data table, download link, and some basic plotting capabilities. However, the site severely limits the quantity of data available for download on each search. For example, hourly data for all variables can be downloaded as one file for every month of a given station′s archive. For the climate stations of major Canadian cities, there are 60 years of hourly data on record, meaning that 720 files need to be manually specified and downloaded (one for each month in the archive). The manual effort to download and collate all of the files for more than one station can take several days or weeks and represents a considerable time investment, thus limiting or even preventing the usefulness of NCDIA data for detailed analyses. To help with this problem we developed the Canadian Climate Data Scraping Tool (CCDST) to automate the download and collation of Canadian climate data from the NCDIA. The CCDST operates as a Visual Basic for Applications (VBA) macro embedded within a Microsoft Excels worksheet. Users specify the date range, site, and data format. The tool automatically downloads individual data files, removes file headers, and merges files in a format ready for analysis. Here, we present an overview of the CCDST software and workflow, and include a short case study detailing its application to detect trends in blowing snow on the Canadian Prairies. The concept and structure of the tool are adaptable to other Web-based geophysical data archives where the data of interest are separated on multiple Web pages.
2. Tool structure and functioning Web data scraping is an automated process of compiling data from multiple Web pages in a systematic way. The scraping software mimics the browsing interaction between Web servers and humans (Glez-Peña et al., 2013), but through a series of automated processes. Scrapers quickly access, extract, and consolidate data into a structured output. A detailed overview of Web data scraping is provided by Glez-Peña et al. (2013), but for context we present a brief overview of the main processes. First, the scraper accesses the Web site or Web page through the Hypertext Transfer Protocol (HTTP). It then searches the Web pages and downloads the data of interest, which may be encoded in Hypertext Markup Language (HTML) or available through links to data files. The number of Web pages accessed and searched depends on the parameters established in the scraper before it is activated. The final step is to transform the data and consolidate them into a structure suitable for analysis. In the end, the scraper converts fragments of data into one cohesive dataset. The main considerations in developing the CCDST were to make the tool efficient and easy to use by a broad user group: from students to professionals and researchers. As such, the CCDST consists of a VBA macro that is operated within Microsoft Excels. Microsoft Excels is a widely used spreadsheet program commonly available in institutional and enterprise settings. Many students and professionals use Excel for basic manipulation, graphing, and statistical analyses of climate data. Other common interpretive programming environments require download and installation of specialized interpreters (e.g., Python, R, and Octave), or costly and less common licenses (e.g., Matlab, SPSS, and SAS). By using Excel, we minimize the computing skill required to operate the program, include users in institutional settings who do not have install privileges on their computers (but have Excel), and avoid costly license fees for specialized programs. There have been previous attempts to create scrapers for the NCDIA in R and Python, but to our knowledge these scrapers no longer function (see ‘CHCN’ R package: http://cran.r-project.org/web/packages/CHCN/CHCN.pdf,
Fig. 1. Main components of a URL from a climate station (Calgary International Airport, WMO ID: 71877) accessed through the “Climate Data Online” section of the NCDIA Website.
and ‘Canadian Weather Stations Python Scraper’: https://classic. scraperwiki.com/scrapers/can-weather-stations/). The CCDST tool accepts the following inputs: the start and end dates of the climate time series, and a unique URL that specifies the site and data format desired (site and data interval). The start and end dates are optional, as the default setting is that the CCDST will download the entire record for a site. The tool outputs a collated data file for the entire duration in a comma delimited ASCII text file format (CSV). The workflow for downloading data is straightforward and incorporates all the basic concepts of Web scrapers (Yang et al., 2010; Glez-Peña et al., 2013). First, users visit the NCDIA Web site to obtain a current download URL for a sample data file (see e.g. Fig. 1). This data file must correspond to a particular date for the desired site. However, the precise date does not matter because the URL contains information about the date range for the entire record (Fig. 1). Second, the URL is input into CCDST within Excel, and the date ranges are specified (or left blank). At this point the user clicks ‘Download data’ and waits. The CCDST parses and validates user inputs and deconstructs the URL from the NCDIA, then automatically begins to download ‘raw’ data files from the NCDIA, storing the files locally. Finally, the CCDST automatically removes file headers and chronologically merges the ‘raw’ data files and the collated data file is saved. At this point the user is notified with a diagnostic summary of operations. Users operate the tool through a GUI interface that is designed to simplify tool operation and lower barriers for those unfamiliar with programming (Fig. 2). The CCDST takes advantage of a simple URL structure for data access at the NCDIA. The URL consists of a series of arguments led by a ‘?’ symbol, and delimited by ‘&’ symbols (Fig. 1). By separating the URL into its components the CCDST can iteratively reassemble the URL and download consecutive data files. Each site within the NCDIA record has a unique numerical station identifier. We require users to visit the NCDIA Website to obtain a sample URL so that the tool can still operate if the URL base changes (see Fig. 1), and also so that users can obtain data from stations added in the future. While convenient, a static database of NCDIA station identifiers limits application of the CCDST to stations presently available.
Fig. 2. CCDST Graphical User Interface showing main inputs.
C. Bonifacio et al. / Computers & Geosciences 75 (2015) 13–16
While Excel is commonly used for climate data analyses, the CCDST outputs a comma delimited text file, which is straightforward to read into other data analysis programs. We have intentionally kept the VBA code for the macro unlocked so that developers can modify the code for other Web databases that are structured in a way similar to the NCDIA. Thus, while the main focus of our effort is climate data, there are other archives that could be consolidated. For example, an archive of historical radar images is available from Environment Canada, which uses the same structure as the NCDIA. Consequently, a time series of radar images can easily be downloaded by modifying the CCDST VBA code. It is important for users of the tool to understand that the data available from the NCDIA are reviewed prior to posting for quality control purposes, but the database might contain errors caused by non-climatic variations, which impacts the commensurability of the data. A number of changes can create non-climatic variations in climate series: changes in instrumentation, such as different models of anemometers for measuring wind speed, relocation of the station, addition of buildings and/or changes to the surface near the station, etc. Some datasets require homogenization – the process of removing non-climatic changes – before they can be used to reliably assess spatial and/or temporal changes in the climate parameters. Data from the NCDIA are not homogenized, and as such, are potentially subject to the aforementioned error. Environment Canada does provide homogenized data, but they are not currently offered through the NCDIA. A number of techniques are available to homogenize data (e.g., change point analysis), and we refer the reader to several sources for more information (see Peterson et al., 1998; Aguilar et al., 2003).
3. Case example: frequency of blowing snow on the Canadian Prairies Blowing snow is common during winter months on the Canadian Prairies (Li and Pomeroy, 1997b). Blowing snow presents a considerable hazard for drivers and has been attributed in numerous fatalities and results in frequent road closures, with concomitant economic losses (see Fig. 3; e.g., http://www.theglobeandmail.com/news/na tional/western-canadas-wild-winds-snap-power-lines-topple-trees/ar ticle16352215/, accessed: 11 April 2014). Blowing snow also results in horizontal advection and redistribution of surface water, in addition to producing considerable water flux into the atmosphere through sublimation during transport. Water (in the form of snowpack) is critically important for semi-arid portions of the prairies, and thus, the
15
hydrology of blowing snow has received considerable scientific attention (e.g., Pomeroy et al., 1993; Li and Pomeroy, 1997a; Essery et al., 1999; Walter et al., 2004). Unfortunately, there are currently no prairie-wide blowing snow monitoring systems in place. Developing comprehensive and empirically-validated models to predict blowing snow occurrence would allow development of operational warning systems. Predictive warning systems could reduce hazard to drivers and (for example) improve freight logistics by providing future probabilities of road conditions, risk, and closures. Additionally, modeling blowing snow would allow improved understanding of the hydrology associated with the shallow and mobile snowpack on the Canadian Prairies. Many climate stations on the Canadian Prairies record ‘blowing snow’ occurrences based on the notes from manual observations. These observations are recorded on site by trained personnel at hourly intervals with a standardized protocol (Environment Canada, 1977). These data can be accessed through the NCDIA database with the CCDST tool, providing a valuable glimpse into the frequency, trends, and meteorological conditions responsible for blowing snow. Here we use the CCDST to visualize trends in blowing snow frequency from 1953 to 2012. We selected the climate archive of hourly data from five airports in Alberta, which contain variable lengths of manual observation data: Medicine Hat (1953–2012; 500,032 h), Lethbridge (1953–2012; 477,132 h), Calgary (1953–2012; 521,743 h), Red Deer (1953–2012; 529,513 h), and Edmonton (1961–2012; 449,252 h). At Medicine Hat and Lethbridge the continuity of weather observations changed in 2006 and 1994, respectively, after which observations were no longer performed 24 h per day. Once all the parameters were specified in the CCDST, the download took approximately 10 minutes for each station, for a total of 50 min. Most of this time was spent waiting for data to download. In contrast, without the CCDST, the data download would have taken considerably longer. Each monthly record of hourly data takes approximately 2 min to specify, download, clean, and collate. For 59 years of data and 5 sites, this is approximately 118 h (59 years 12 months/year 5 sites 2 min per data file). Quantified with common Canadian working schedules, this would have taken 3.1 weeks (7.5 h/day, 5 days/week, completely dedicated to file downloading) to download and collate the data. For this case example, the CCDST has resulted in an over 99% time and cost savings. With these data we can analyze the temporal dynamics of blowing snow occurrences (similar to ‘blowing dust’ in Fox et al., 2012). Results (Fig. 4) show no consistent patterns in terms of the main peaks; however, there is a period from 1991 to 2010 in which four of the stations show relatively low blowing snow frequency. Edmonton is the only station with a noticeable declining trend over the period of record. Despite its shorter record, Lethbridge clearly stands out in terms of the number and frequency of blowing snow events, which is matched by our observations of frequent winter road closures and poor driving conditions in the vicinity. While we only present a brief overview of larger scale trends in blowing snow frequency as a demonstration, more detailed analysis of the temporal dynamics of blowing snow is a straightforward extension (e.g., Li and Pomeroy, 1997a).
4. Conclusions
Fig. 3. Blowing snow is a major driving hazard on the Canadian Prairies (Photo credit: Tayler Hamilton).
The CCDST is a new tool that makes downloading of Canadian climate data from the NCDIA far more efficient than using manual downloads, with time and labor cost savings of up to 99%. By making the tool free to students and researchers we hope to increase productivity by replacing manual procedures with a more
16
C. Bonifacio et al. / Computers & Geosciences 75 (2015) 13–16
Fig. 4. Blowing snow frequency at five climate stations in Alberta, Canada (Medicine Hat: 50.0°N, 110.7°W; Lethbridge: 49.7°N, 112.8°W; Calgary: 51.0°N, 114.1°W; Red Deer: 52.3°N, 113.8°W; and Edmonton: 53.5°N, 113.5°W).
automated process. We presented a short case study of blowing snow frequency on the Canadian Prairies, demonstrating important time savings. While the time savings of using the CCDST are important, we hope the most valuable result of the CCDST is new analyses of Canadian climate data that in the past were considered unfeasible or too time consuming.
Acknowledgments The authors thank two anonymous reviewers and the Editorin-Chief for helpful comments that improved the scope of this paper.
Appendix A. Supplementary material Supplementary data associated with this article can be found in the online version at http://dx.doi.org/10.1016/j.cageo.2014.10.010.
References Aguilar, E., Auer, I., Brunet, M., Peterson, T.C., Wieringa, J., 2003. Guidance on metadata and homogenizationWorld Climate Data and Monitoring Programme, Report no. 53, WMO-TD no. 1186.World Meteorological Organization, Geneva. Environment Canada, 1977. MANOBS: Manual of Surface Weather Observations. Environment Canada, Meteorological Service of Canada, Ottawa.
Essery, R., Li, L., Pomeroy, J.W., 1999. A distributed model of blowing snow over complex terrain. Hydrol. Process. 13, 2423–2438. Fox, T.A., Barchyn, T.E., Hugenholtz, C.H., 2012. Successes of soil conservation in the Canadian Prairies highlighted by a historical decline in blowing dust. Environ. Res. Lett. 7, 014008. Glez-Peña, D., Laurenço, A., López-Fernández, H., Reborio-Jato, M., Fdez-Riverola, F., 2013. Web scraping technologies in an API world. Brief. Bioinform. 15 (5), 788–797. http://dx.doi.org/10.1093/bib/bbt026. Hansen, J., Ruedy, R., Sato, M., Lo, K., 2010. Global surface temperature change. Rev. Geophys. 48, RG4004. Hugenholtz, C.H., 2013. Anatomy of the November 2011 windstorms in southern Alberta, Canada. Weather 68, 295–299. Hutchinson, M.F., McKenney, D.W., Lawrence, K., Pedlar, J.H., Hopkinson, R.F., Milewska, E., Papadopol, P., 2009. Development and testing of Canada-wide interpolated spatial models of daily minimum–maximum temperature and precipitation for 1961–2003. J. Appl. Meteorol. Climatol. 48, 725–741. Jones, P.D., Lister, D.H., Osborn, T.J., Harpham, C., Salmon, M., Morice, C.P., 2012. Hemispheric and large‐scale land‐surface air temperature variations: an extensive revision and an update to 2010. J. Geophys. Res.: Atmos.117. Lawrimore, J.H., Menne, M.J., Gleason, B.E., Williams, C.N., Wuertz, D.B., Vose, R.S., Rennie, J., 2011. An overview of the Global Historical Climatology Network monthly mean temperature data set, version 3. J. Geophys. Res. 116, D19121. Li, L., Pomeroy, J.W., 1997a. Estimates of threshold wind speeds for snow transport using meteorological data. J. Appl. Meteorol. 36, 205–213. Li, L., Pomeroy, J.W., 1997b. Probability of occurrence of blowing snow. J. Geophys. Res. 102, 21955–21964. McKenney, D.W., Hutchinson, M.F., Papadopol, P., Lawrence, K., Pedlar, J., Campbell, K., et al., 2011. Customized spatial climate models for North America. Bull. Am. Meteorol. Soc. 92, 1611–1622. Menne, M.J., Durre, I., Vose, R.S., Gleason, B.E., Houston, T.G., 2012. An overview of the global historical climatology network-daily database. J. Atmos. Ocean. Technol. 29, 897–910. Peterson, T.C., Easterling, D.R., Karl, T.R., Groisman, P., Nicholls, N., Plummer, N., Torok, S., Auer, I., Boehm, R., Gullett, D., Vincent, L., Heino, R., Tuomenvirta, H., Mestre, O., Szentimrey, T., Salinger, J., Førland, E.J., Hanssen-Bauer, I., Alexandersson, H., Jones, P., Parke, D., 1998. Homogeneity adjustments of in situ atmospheric climate data: a review. Int. J. Climatol. 18 (13), 1493–1517. Pomeroy, J.W., Gray, D.M., Landine, P.G., 1993. The prairie blowing snow model: characteristics, validation, operation. J. Hydrol. 144, 165–192. Rohde, R., Muller, R., Jacobsen, R., Muller, E., Perlmutter, S., Rosenfeld, A., Wurtele, J., Groom, D., Wickham, C., 2013. A new estimate of the average earth surface land temperature spanning 1753–2011. Geoinform. Geostat.: Overview 1, 2. Thorne, P.W., Willett, K.M., Allan, R.J., Bojinski, S., Christy, J.R., Fox, N., et al., 2011. Guiding the creation of a comprehensive surface temperature resource for twenty-first-century climate science. Bull. Am. Meteorol. Soc. 92, ES40–ES47. Vincent, L.A., Gullett, D.W., 1999. Canadian historical and homogeneous temperature datasets for climate change analyses. Int. J. Climatol. 19, 1375–1388. Vincent, L.A., Wang, X.L., Milewska, E.J., Wan, H., Yang, F., Swail, V., 2012. A second generation of homogenized Canadian monthly surface air temperature for climate trend analysis. J. Geophys. Res.: Atmos. 117, D18110. Wan, H., Wang, X.L., Swail, V.R., 2010. Homogenization and trend analysis of Canadian near-surface wind speeds. J. Clim. 23, 1209–1225. Walter, M.T., McCool, D.K., King, L.G., Molnau, M., Campbell, G.S., 2004. Simple snowdrift model for distributed hydrological modeling. J. Hydrol. Eng. 9, 280–287. Yang, Y., Wilson, L.T., Wang, J., 2010. Development of an automated climatic data scraping, filtering and display system. Comput. Electron. Agric. 71 (1), 77–87.