Environmental Modelling & Software 60 (2014) 241e249
Contents lists available at ScienceDirect
Environmental Modelling & Software journal homepage: www.elsevier.com/locate/envsoft
A scientific data processing framework for time series NetCDF data Krista Gaustad a, *, Tim Shippert a, Brian Ermold a, Sherman Beus a, Jeff Daily a, Atle Borsholm b, Kevin Fox a a b
Pacific Northwest National Laboratory, 902 Battelle Boulevard, P.O. Box 999 MSIN K7-28, Richland, WA 99352, United States Exelis Visual Information Solutions, Inc., 4990 Pearl East Circle, Boulder, CO 80301, United States
a r t i c l e i n f o
a b s t r a c t
Article history: Received 24 August 2013 Received in revised form 2 June 2014 Accepted 7 June 2014 Available online
The Atmospheric Radiation Measurement (ARM) Data Integrator (ADI) is a framework designed to streamline the development of scientific algorithms that analyze, and models that use time-series NetCDF data. ADI automates the process of retrieving and preparing data for analysis, provides a modular, flexible framework that simplifies software development, and supports a data integration workflow. Algorithm and model input data, preprocessing, and output data specifications are defined through a graphical interface. ADI includes a library of software modules to support the workflow, and a source code generator that produces C, IDL®, and Python™ templates to jump start development. While developed for processing climate data, ADI can be applied to any time-series data. This paper discusses the ADI framework, and how ADI's capabilities can decrease the time and cost of implementing scientific algorithms allowing modelers and scientists to focus their efforts on their research rather than preparing and packaging data. © 2014 Elsevier Ltd. All rights reserved.
Keywords: Atmospheric science Time-series NetCDF Scientific data analysis Observation data Scientific workflow Data management
1. Introduction Since 1992, the U. S. Department of Energy's Atmospheric Radiation Measurement (ARM) program (Stokes and Schwartz, 1994) has been collecting, processing into Network Common Data Form (NetCDF) format, and distributing data from highly instrumented ground stations. The instrumentation is positioned across the globe in both permanent and mobile facilities. The program maintains a production data processing center that ingests the data collected from its instruments, and creates higher quality data and data more scientifically relevant to achieving its goal of using the program's data to evaluate and improve global climate models (GCMs). The higher level data products, referred to as “Value Added Products” (VAPs), are created by applying increasingly more advanced analysis techniques to existing data products. Examples include precipitable water vapor and liquid water path retrievals that improve the modeling of the diabatic feedback from clouds in GCMs by improving the understanding of the impact of clouds on the radiative flux (Turner et al., 2007), and a closure experiment designed to analyze and improve ARM's Line-by-Line Radiative Transfer Model (LBLRTM) and the spectral line parameters it uses (Turner
* Corresponding author. Tel.: þ1 509 375 5950. E-mail address:
[email protected] (K. Gaustad). http://dx.doi.org/10.1016/j.envsoft.2014.06.005 1364-8152/© 2014 Elsevier Ltd. All rights reserved.
et al., 2004). The latter has the potential to significantly improve GCM performance through contributing to small improvements in the accuracy for radiative transfer models used by GCM's (Ellingson and Wiscombe, 1996). ARM has also developed a Climate Modeling Best Estimate (CMBE) data set for use by global climate modelers (Xie et al., 2010). While these examples were implemented prior to ADI's release, they exemplify the intent and nature of algorithms that are currently, and will be developed within the ADI framework. This paper describes how ADI simplifies the access to, manipulation of, and generation of time-series data products, and how ADI can be used to expedite the development and analysis of robust, flexible, scientific algorithms and atmospheric process models. As previously described, continued improvements to climate and atmospheric process models' performance through the analysis of ARM's cloud and radiation observations requires the implementation of increasingly more complex routines that examine larger and more diverse datasets. VAPs recently released and those currently under development typically access hundreds of variables from many data sources, each with their own distinct coordinate system grids. To work efficiently with these heterogeneous input data sets, and to reduce the time focused on managing complex input data, ARM developed the ADI framework and associated development environment. ADI automates, to the extent possible, the integration of diverse input data into common formats, streamlines the creation of standard, well-documented output data
242
K. Gaustad et al. / Environmental Modelling & Software 60 (2014) 241e249
products, and decreases the time needed to import a scientist's prototype algorithm into ARM's production data processing environment. It supports automation through the use of a graphical interface that stores retrieval process definitions allowing them to be shared with others. Based on porting existing algorithms to ADI, ARM has noted a decrease in the time needed to perform typical pre- and post-data preparation tasks from about two days down to a few hours for the simplest case of a single input and output data product. For more complex algorithms, the time needed to implement the data retrieval, integration, and creation of output data products decreased from several weeks to at most a few days. Perhaps more importantly, the confidence in the quality and consistency of the preprocessing has increased because of the use of a standard set of functions and libraries. While ADI was developed to support the incorporation of wellestablished algorithms into a production data processing center, its preprocessing capabilities can also benefit scientists and modelers implementing models or testing algorithms for non-processing system applications. It does this by facilitating the comparison, parameterization, and evaluation of model data. Scientific research frequently makes use of measurement data to develop and evaluate theories and to illustrate key findings. Before scientists can perform analysis on data collected from instruments or from products derived from instrument data, they must first not only obtain input data, but also frequently alter its format to meet their specific needs. The modeling community also experience difficulties in working with non-standard programming interfaces, resource consuming input data preparation and preprocessing, and integration of data in diverse file formats and coordinate grid shapes. The use of multi-component, self-describing frameworks with common interfaces has demonstrated advantages in terms of metadata, provenance, error handling, and reproducibility that improves the performance of, and user's interaction with a workflow (Turuncoglu et al., 2013). ADI can be viewed as a type of integrated environmental modeling (IEM) framework in that it supports the implementation of models as individual components, moves the information between process and model components, captures provenance, supports data transformations to prepare data for the model(s), integrates standards and error handling, supports multiple programming languages, and allows modelers to retain ownership of their process allowing them to share and exchange models and processes (Whelan et al., 2014). Recognizing that many scientists will not want to work within the ADI development environment, but may still want to make use of its automated data preparation and production capabilities, a command line application of the ADI workflow has been implemented. This is referred to as the Data Consolidator application. Based on the information provided via the graphical interface, the Data Consolidator application executes the data retrieval, preprocessing, and data production workflow providing a dataset in the format and with the content needed for a user's application or model. Thus, ADI can decrease the time and expense associated with the non-scientific tasks necessary to perform scientific analysis whether in a production data processing environment or to meet individual pre and post data processing needs. 2. Comparison with existing tools This section will describe already existing frameworks, data transformation models, and supporting tools that perform similar functions as the ADI framework and its libraries, summarize key characteristics and capabilities that were considered in ADI's design, and compare ADI with other platforms and tools in the context of these qualities. It will also describe ADI's design decisions
in terms of alternative architectures, programming, and data manipulation tools. The need for a data integration platform to access, integrate, analyze, and share data from diverse datasets is a problem not only for the atmospheric science community (Woolf et al., 2005), but also for many other scientific disciplines such as the biological sciences (Nelson et al., 2011) and hydrology (Ames et al., 2012). Across most disciplines, including the atmospheric sciences, the typical solution is a framework through which tools specific to each data processing step can be integrated and applied as needed. Many framework solutions propose designs, but have not been implemented (Woolf et al., 2005) or have been prototyped, but have not been widely used (Cheng et al., 2009). Frameworks designed specifically for data transformations often focus on meeting the general transformation needs of data archives and warehouses, such as providing transformations that support changes in data formats (Abiteboul et al., 1999), the adoption of newer systems that use different data models, or to clean and consolidate the data (Claypool and Rundensteiner, 1999). Fully operational atmospheric science frameworks have been developed, with some very similar to ADI such as the High Spectral Resolution Lidar (Eloranta, 2005), whose data download functionality also features a graphical interface (http://hsrl.ssec.wisc.edu/) that allows users to select transformation parameters that will be applied to the lidar data prior to delivery. However, this only provides transformation functionality to data from a single instrument. Giovanni (Berrick et al., 2008) is a workflow that provides a set of data analysis recipes through which the user can manipulate data and powerful tools for visualizing the data. It is also similar to ADI in that it uses a GUI (Graphical User Interface) to simplify its use, but it does not allow users to implement their own data recipes. The Earth System Grid (ESG) (Bernholdt et al., 2005) is a project intended to address distributed and heterogeneous climate dataset management, discovery, access, and analysis, with an emphasis on data delivery, but a current focus is on climate simulation datasets (although it has expressed a long-term goal of supporting observation data). Like ARM, ESG expects data it distributes to adhere to its own specified set of standards. While ESG intends to eventually support data concatenation and sub-setting, its primary mechanism for providing data operations capability is by providing users with tools developed by others. The Climate Data Analysis Tools (CDAT) is the ESG's data analysis engine (Williams et al., 2009). CDAT is described as a framework that consolidates access to a disparate set of software tools for the discovery, analysis, and intercomparison of coupled multi-model climate data. As a framework solution, CDAT includes low-level data toolkits such as the Climate Model Output Rewrite, or CMOR (Doutriaux and Taylor, 2011). CMOR is similar to ADI in that it is a software library used to produce NetCDF CF (Climate and Forecast)-compliant (Davis et al., 2014) NetCDF files, has a built-in checker to confirm adherence to CF standards, uses UDUNITS (Unidata, 2011), and allows users to provide data of any type, unit, and dimension order. However, it does not appear to have been developed to be an all-purpose data writer as its design is geared toward preparing and managing Model Intercomparison Project (MIP) output (i.e., the climate community's standard model experiments), and only automatically converts to the units and types expected by MIP models. Scientific and gridded data transformation models are available in numerical programming environments, and as libraries of command-line, file-I/O-based operators. Numerical programming environments that support data transformations include MATLAB®1 (http://www.mathworks.com), its GNU version Octave
1
MATLAB is a registered trademark of The MathWorks, Inc.
K. Gaustad et al. / Environmental Modelling & Software 60 (2014) 241e249
(Eaton et al., 1997), S Plus, and its GNU version R (Team, 2005). Two widely used file-based operator libraries include NetCDF Operators (NCO) (Zender, 2008) and Climate Data Operators (CDO) (https:// code.zmaw.de/projects/cdo). The NetCDF Operators (NCO) is a widely used set of low-level functions that supports the analysis of self-describing gridded geoscience data. Many of the NCO and CDO functions are similar to operations performed through the ADI GUI (such as reading, writing, interpolating, and averaging). But these operators, along with functionality not supported by ADI, could be accessed from within the ADI framework as a useful alternative and supplement to the methods already supported by ADI. Several existing frameworks use NCO, such as the Script Workflow Analysis for Multi-Processing (SWAMP) system, which is a framework that uses shellescript interfaces to optimize data analysis by running at data centers, thus avoiding the overhead associated with moving around big data sets (Wang et al., 2009). An evaluation of existing tools and approaches was conducted in the spring of 2009 to determine whether an existing system could be used or leveraged to meet the architectural, standardized software development environment, data retrieval, transformation, and creation requirements that ARM had determined were necessary to achieve the desired savings in algorithm development time and cost. Not surprisingly, no single system was found that met the program's needs. Existing solutions tended to focus either on providing a flexible architecture to integrate workflow components and data analysis tools, or on providing low-level tools that access, manipulate, or create data. ADI falls in between these two paradigms, in that it didn't need most of the capabilities provided by available architectures. The low-level data retrieval, manipulation, and transformation libraries best suited for ARM's requirements were designed to either work with NetCDF file inputs and outputs (which makes them available to diverse users and systems, but does not allow them to efficiently be invoked from within an algorithm) or through environments well suited for design. These libraries were not well suited for production data processing and were generalized to all gridded systems (e.g. MATLAB and S Plus) requiring some additional effort to apply them to time-series data. Scientific workflow applications, such as Eclipse and Kepler (https://kepler-project.org), appeared particularly well suited for the design and prototyping of algorithms ARM required. Recognizing this, the initial ADI prototype was implemented as a plug-in for the Eclipse geographic software development environment and extensible plug-in system (http://www.eclipse.org/) integrated development environment. However, testing in the prototype stage revealed that users were not making use of the platform. The workflow components needed to automate pre and post data processing and jump start algorithm development are static and follow a known order. This realization and the recognition of unneeded advanced features relating to visualizing data increased the complexity and decreased efficiency more than the benefits gained. As a result, the Eclipse environment was discontinued and scripts are now used to execute the standard workflow to minimize complexity and improve processing efficiencies. This does not preclude the workflow from being ported to a more flexible framework and the scripts can be replaced by calling the ADI components from a system that supports user-defined workflows. ADI users can incorporate I/O-based operator libraries such as CDO and NCO that provide mathematical, statistical, and data transformation capabilities into their processing through the use of functions provided to allow users to create intermediate output files within the workflow. These intermediate files could be used as input to the operator library functions and the results pulled back into the ADI process. Functionalities provided by supported languages are also readily available. Users who prefer working within programming environments such as MATLAB can use the Data
243
Consolidator application as a pre-processing tool to create input data products suitable for subsequent use within these environments. ADI stores data retrieved from input files in data structures accessible to its modules and to users via data access functions. All of the data retrieved is stored in memory. Large files currently can be handled by simply reading in only the variables needed and by limiting the number of records read in each pass through the ADI workflow by setting the appropriate processing parameters. A process interval parameter controls the amount of data pushed through the workflow and a split interval sets the maximum file size of the output data products. Data is processed through the workflow and incrementally written to create the output file until the split interval size is reached. After that point, subsequent data is written to a new file. This method of handling large files is only helpful for processing that can be performed in sequential chunks. ADI has been implemented on the Olympus computer supported by Pacific Northwest National Laboratory (PNNL) Institutional Computing (PIC) program (Pacific Northwest National Laboratory, 2014) for use with computationally intensive models and algorithms. In addition, the data assimilation and transformation methods of ADI could fit into a MapReduce framework to process scalable datasets, and in the future ARM plans to implement multiprocessor and distributed processing methods natively within ADI. ADI's framework and low-level data I/O, NetCDF access, and manipulation functions were implemented in C because of its processing efficiencies, and because many other higher-level languages interact well with C. To support development in other languages, with the same look and feel developers expect in those languages, bindings to the C-level functions implemented in each supported language are needed. The script that defines the workflow and the templates to jump start algorithm development must also be implemented in each language. Python™2 (Van Rossum and Drake, 2003) and IDL®3 (Interactive Data Language) (http://www. exelisvis.com/idl) were selected as the initial programming languages that would be supported because of the wide use of IDL by the atmospheric science community and the extensive and continually growing scientific modules supported in Python. While ARM does not currently have plans to support ADI in any other development languages, support could be extended to other scientific languages such as MATLAB and R. MATLAB was evaluated as a candidate development language, but the features that make it valuable for scientists in designing their algorithms make it significantly slower than its alternatives. The GNU Data Language was evaluated for use by the ARM infrastructure in 2011 (Coulais et al., 2011) and was not considered stable and complete enough to meet the infrastructure's needs at that time. The program will revisit this decision in the future and could extend ADI support to include GDL if such a review shows the language has evolved sufficiently to meet the program's needs. Table 1 notes the key characteristics and capabilities that were considered in the design of ADI and other available scientific workflow architectures evaluated. Frameworks examined and the references used to determine whether a capability was met include: Framework for Meteosat data processing, FMet (Cermak et al., 2007), High Spectral Resolution Lidar (Eloranta, 2005), Giovanni (Berrick et al., 2008), Earth System Grid, ESG (Bernholdt et al., 2005), Climate Data Analysis Tools CDAT (Williams et al., 2009), Script Workflow Analysis for Multi-Processing, SWAMP (Wang
2 3
Python is a trademark of the Python Software Foundation. IDL is a registered trademark of Exelis Inc.
244
K. Gaustad et al. / Environmental Modelling & Software 60 (2014) 241e249
Table 1 Comparison of ADI to alternative framework architectures.
software freely available runs on non-proprietary operating system user defined workflows user-defined transformations can run user algorithms data visualization tools command line interface configurable through GUI languages supported facilitates algorithm development process large files
Fmet
HSRL
Gio-vanni
ESG
CDAT
SWAMP
REAP
ADI
Y Y Y N Y Y N Y Fortran N ?
web app N/A N Y(limited) N Y N/A Y N/A N/A N/A
web app N/A N N N Y N Y N/A N N/A
web app N/A Y(via CDAT) Y(via CDAT) Y(via CDAT) Y(via CDAT) Y(via CDAT) via CDAT Y(via CDAT) Y(via CDAT) Y(via CDAT)
Y Y Y Y Y Y Y Y PythonJavaC/CþþFortran Y Y
Y Y Y Y(via NCO) Y N Y N NCO no Y(via NCO)
N Y Y N Y N N N N/A N ?
Y Y Y Y N Y Y CIDLPython Y Y(limited)
et al., 2009), and Real-time Environment for Analytical Processing, REAP (Barseghian et al., 2010). Table 2 compares ADI lower-level data manipulation capabilities with those available in other tools. While the capabilities appear very similar, analytical systems such as R and MATLAB and libraries associated with programming languages provide limited integrated support for accessing, analyzing, and creating scientific datasets. NCO, CDO, CMOR, and ADI provide higher level functions that users would have to write themselves using tools associated with particular programming languages. 3. Materials and methods The following sections discuss the composition of ARM data, the software libraries, software packages, and databases used to support ADI.
3.5. SASL Simple Authentication and Security layer (SASL), version 2.1.23, is used for the MD5 functions. ADI will be updated to use OpenSSL in the future. 3.6. Apache Flex The browser-based Process Configuration Manager (PCM) graphical interface was developed using the Apache Flex®4 (Apache Software Foundation, 2014) software development kit but only the Adobe Flash Player is needed to run it. 3.7. Cheetah To jump-start user algorithm development, the ADI template source code generator uses the open source template engine and code generation tool, Cheetah (Rudd et al., 2001) to generate the C, IDL, and Python data integration algorithms with user hooks. 3.8. System architecture
3.1. ARM data This initial implementation of ADI has been developed for use with NetCDF (Rew and Davis, 1990) data produced by the ARM program. ARM data is publicly available and can be accessed through the ARM Data Archive (http://www.archive.arm.gov), requiring registration only to allow the program to keep metrics on the data it delivers. However, ADI can be used to process any time series NetCDF data that conforms to NetCDF standards. ARM instrument data is collected, converted from the instrument's native or raw data format to NetCDF, and then delivered to the ARM Data Archive for permanent storage. ARM creates higher-level data products by applying scientific algorithms to the ingested instrument data. 3.2. Unidata NetCDF and UDunits software packages
The ADI libraries and applications have been tested with Red Hat Enterprise Linux 2.5 and 2.6. The current ADI libraries have only been compiled and tested for the Red Hat Enterprise Linux 2.5 and 2.6 operating systems, but they should be able to run under any non-proprietary operating system (e.g. GNU/Linux, BSD) that supports the previously discussed libraries and packages. The beta version of the libraries was developed and tested for the Fedora™5 13 operating system. 3.9. IDL To develop algorithms in IDL, version 8.2 is needed to support 64-bit. 3.10. Logs and metrics-related components
ADI's convenience functions use Unidata's NetCDF4 software libraries (Davis et al., 2014) to access input files and create output files in the NetCDF format, and the UDUNITS2 package (Unidata, 2011) to convert units.
ADI's data system database (DSDB) is implemented using the PostgreSQL 8.4 open-source Object-Relational database system (PostgreSQL, 2014). ADI can be configured to build without it if only the Web Service backend is needed.
ADI has a built-in logging and messaging system. Most functions that ADI performs automatically (i.e. all the green boxes (in the web version) in Fig. 1.) will be logged and unusual circumstances such as missing files or system issues will be identified. ADI also provides standard tools for the developer to log and classify their messages as regular logs, warnings, or errors. In this way every ADI process will have a standardized log, which will allow for the development of automated methods for tracking provenance, benchmarking, or data mining to track when and how often specific events occur.
3.4. LIBCURL
4. Results
3.3. PostgreSQL database
Multiprotocol file transform library (http://curl.haxx.se/libcurl/), version 7.19.7, is used to access the DSDB via a Web Service. ADI can be configured to build without it if only the PostgreSQL backend is needed.
Table 2 Comparison of ADI to alternative data processing libraries and tools. IDL Python MatLab, S Plus, CMOR NCO CDO ADI Octave R command line interface scriptable reads NetCDF writes NetCDF transformation functions integrate data through GUI provides mathematical, statistical functions
Y Y Y Y N N Y
Y Y Y Y N N Y
Y Y Y Y N N Y
Y Y Y Y N N Y
N Y N Y N N N
Y Y Y Y Y N Y
Y Y Y Y Y N Y
N Y Y Y Y Y N
The Atmospheric Radiation Measurement (ARM) Data Integrator (ADI) is a suite of tools, libraries, data structures, and interfaces that simplify the development of scientific algorithms that process time-series data. It minimizes the amount of source code needed to implement an algorithm by automating data retrieval and the creation of output data products, allowing development efforts to be focused on the science. Built-in integration capabilities include merging data, applying data type and unit conversions, and coordinate dimension transformations. ADI's web-based graphical interface, the Process Configuration Manager (PCM), is used to
4 Apache Flex and Apache are registered trademarks of The Apache Software Foundation. 5 Fedora is a trademark of the Fedora Project.
K. Gaustad et al. / Environmental Modelling & Software 60 (2014) 241e249
245
Fig. 1. ARM ADI framework, Data System DataBase (DSDB) and related components.
define the parameters that describe the data to be retrieved, how it should be integrated, and the content of the output data products. This information is stored in a central database and made available through web services to ADI's processing framework and other supporting applications. Fig. 1 shows the relationship of the graphical interface to the database, the ADI framework, and supporting applications. Once a process is defined in the PCM, users can develop their own source code to run under the ADI framework. However, if a user simply needs to consolidate data from existing NetCDF files or transform data onto a new coordinate grid without any additional analysis, the Data Consolidator application can be used without the need to write any code. The core modules executed by the workflow are shown in the green (in the web version) boxes in Fig. 2. A source code generation tool is provided to jump-start user algorithm development. It uses PCM retrieval process definitions to create an initial set of C, Python, or IDL software project files that execute the core ADI processing modules and to provide hooks into which users can insert
custom logic. The hooks, designed to allow users to access any point in the framework, are represented in Fig. 2 as blue circles. The remainder of this section describes the components of the ADI framework in more detail. 4.1. Configuring a process The Process Configuration Manager (PCM) is the graphical user interface through which processes and datastreams are defined, viewed, and edited. This interface simplifies the development of new ARM algorithms and data products by providing access to existing ARM datastreams and variables for use in defining the input and output of a new process. Defining a process includes specifying the input data that needs to be retrieved, the output datastream(s) that will be created (Figure S1), the mapping of input data that is passed through to the output, process specifications such as the size of the data chunks sent through the data integration workflow, and the size of the data products produced. ADI can also transform all input data onto a
Fig. 2. ARM ADI workflow and framework hooks.
246
K. Gaustad et al. / Environmental Modelling & Software 60 (2014) 241e249
common grid, as well as automatically convert data types and units. The process definition interfaces and their interrelationships to one another are illustrated in online Supplementary materials (Figure S2). Each box represents a window that can be accessed through the PCM's graphical interface, and includes the name of the form displayed in the window and the process attributes that are collected from that form. The process attributes that are defined based on an input or output datastream are noted with blue dashed lines. 4.1.1. Specifying a retrieval The PCM's Retrieval Editor form (Fig. 3) decreases the amount of code VAP developers have to write by allowing users to set criteria that the ADI workflow can use to automate the opening of source data files, retrieval of variables, and execution of coordinate system, unit, and data type transformations. Users can setup a hierarchical list of input data sources (Figure S3), transform the retrieved data onto a new coordinate grid, and can specify the variables to be passed through to output data products (Figure S4). If a supported transformation method is appropriate (interpolation, averaging, and nearest neighbor) the user does not need to supply any code, but must define the desired grid. If the grid is uniform (i.e. the value between coordinate values is constant) or is the same coordinate grid of a retrieved variable, the grid can be defined through the interface (Figure S5). To transform to an irregular grid, the user explicitly defines the elements of the grid through use of a function or via a flat file. No additional code is required. Selection of a preferred and alternative data source can be based on time (where preferred input changes as improved data products are brought on line) and location (typically driven by different input sources being available at different locations). Examples of how data can be manipulated include variable name changes and transformations. A detailed list of retrieval criteria that can be set is available in the online Supplementary materials Table S3. 4.1.2. Datastream definition In addition to fully defining a process, the other functionality provided by the PCM is to simplify the design and creation
of output data products. New datastreams can be defined manually by entering attribute information by hand or by dragging and dropping attributes from output data products already defined in the PCM (Figure S6). Importing header information from an existing NetCDF file can also create a new datastream. ARM, like most scientific data repositories, requires that submitted data conform to standards defined by the program. To facilitate user adherence to the standards, and the program's ability to confirm their adherence, the standard validation logic is embedded in the PCM interface. If a user's output file structure diverges from expected standards, the violations are flagged (Figure S7). 4.2. Data Consolidator tool The Data Consolidator is a stand-alone command-line application that consolidates data from different input sources based on a defined retrieval process and produces the output data product(s) associated with the retrieval process without requiring the user to write any source code. If variables have been defined in an output data product that have not been associated with a variable retrieved or mapped from an input data source, the Data Consolidator tool will create these ‘new’ variables using the metadata defined for the output data product and populate them with default fill values. This feature allows users who do not want to develop software within the ADI framework to use its data integration to prepare data files that can be used as input files to their algorithm or programming environment. Their algorithm simply needs to assign real values to the fill values and does not have to perform transformations or create variables or attributes. 4.3. Algorithm development framework and template generation tool A primary design goal of ADI is to expedite the development of ARM algorithms that add value through the additional processing of existing ARM datastreams and to provide atmospheric scientists
Fig. 3. Retrieval Editor Form main form view of process adi_example1. Variables are retrieved from two datastreams, aosmet.a1 and aoscpcf.a1 both of which are publicly available from the ARM data archive.
K. Gaustad et al. / Environmental Modelling & Software 60 (2014) 241e249
a development environment that will expedite scientific discovery. Algorithm development is supported for three programming languages: C, Python, and IDL. The steps to develop an ADI algorithm include: Define a process in the PCM documenting the process inputs, transformations, and outputs Run the Template Generator application to create an initial set of project files Implement logic specific to the algorithm through use of the appropriate user hook(s) Compile (if using the C language) and run the project Validate output, returning to earlier steps as necessary. Following creation of a retrieval process, a source code Template Generator application is available to create an initial project in the desired programming language. The templates produced include a main routine, macros for the input, transformation, output parameters, and function definitions for frequently used hooks. The project, with no additional logic added, can be compiled and run at the command line to produce output identical to that of the Data Consolidator run for the same retrieval process. A user's code is added using a two-step process: (1) update the main routine to call the necessary hook functions and (2) add the desired logic to the body of the hook functions. A hook is available between each of ADI's main processing modules. The main routine of an ADI algorithm specifies what PCM process(es) can be executed, and which user hooks will be invoked during their processing. Example main routines implemented in C and IDL are illustrated in the online Supplementary materials: Figure S8 and Figure S9 respectively. A description of the core modules, user hooks, and supporting functions are described in Section 4.4, ADI Processing Framework. 4.4. ADI processing framework The ADI processing framework is composed of core modules, user hooks, and internal data structures. Data is passed between the core modules and hooks via internal data structures. Internal structures are used to store retrieved data, transformed data, and output data. A user data structure, whose content and structure are defined by the user, is provided to allow developers a mechanism to share information across the core modules and user hooks. The flow of data through the core modules and user hooks is illustrated in Fig. 2. The core modules (shown as green boxes in Fig. 2) execute the actions necessary to consolidate a group of diverse datasets into a single output data product. The user hooks (shown as blue circles) are the functions into which users can insert their own code to perform scientific analysis and any pre- and postprocessing that are not supported by the core modules. The processing interval is the amount of data that will be pushed through the pipeline at one time. The retrieve, merge, transform, data creation, and storage core modules are invoked once for each process interval. The output interval is the maximum amount of data stored in files generated by the process. The initialize and finish core modules are executed only once. The following sections discuss the core modules, user hooks, and associated data structures that make up the components of the ADI data pipeline. 4.4.1. Core modules The Initialize module is invoked first and sets the stage for the data processing. Tasks performed by the initialization module include capturing command line arguments, pulling process configuration information from the database, opening logs, initializing input and output datastreams, and building the internal
247
data retrieval structure. After the process has been initialized, the execution of the modules that are repeated for each processing interval begins. Each of these modules require several input arguments, including the start and end dates for each process interval, the most relevant internal data structure, and the user's data structure. The Retrieve Data core module transforms units and data types of the input data source to those specified in the graphical interface before populating the retrieved data structure containing the retrieved data transforming units and data types as dictated in the graphical interface. If more than one input data file exists within a processing interval, the data is stored in the retrieved data structure in individual objects for each input data file. Unless the user indicates via a flag not to, the Merge Observation module consolidates instances of multiple individual observations into a single object that spans the entire processing interval. The Transform Data module then maps the retrieved data to the coordinate system grid specified in the Retrieval Editor PCM form, storing the results in the transformed data structure. How the transformation is executed is a function of the transformation parameters that have been set for that coordinate system. In addition to the transformation parameters defined in the graphical interface, additional parameters can be defined either in configuration files or through supporting functions. A detailed description of the functionality and capabilities of the Transform Data module is beyond the scope of this paper. The parameters are presented in ADI's online documentation at https://engineering.arm.gov/ADI_doc/(Gaustad et al., 2014a). Next, the Create Output Datasets module creates an output data structure for each output data product and populates the output variables that were mapped from retrieved variables in the Retrieval Editor Form. Output variables not mapped to a retrieved variable are created and assigned fill data values. The last core module invoked for each process interval, the Store Data module, creates an output NetCDF file(s) for the current processing interval. The process control is then returned to the pre-retrieval hook for the next iteration.
Table 3 Available hooks into which developers can access from their algorithms. User hook
Description
Initialize
Provides space for the user to perform initialization not supported by the ADI initialization module Instantiates the user data structure, which is subsequently passed to all downstream hooks and modules First user function that will be executed for each pass over the individual process intervals that span the processing period Executes before data is retrieved from input data sources Allows users to perform actions on the retrieved data before the individual observations that comprise the current processing interval are merged Allows users an opportunity to access data after it has been merged (making it easier to traverse) but before any transformations are applied Gives users access after the transformations have been applied Frequently used to implement science algorithm because the pre-processing of input data has completed Executed after all intervals falling from the begin and end processing dates as specified in the command line arguments have been completed, but prior to the execution of the core module Finish process Used to free memory allocated for the user data structure that may have been created or to perform whatever other cleanup tasks are needed
Pre-retrieval
Post-retrieval
Pre-transformation
Post-transformation Process data
Finish
248
K. Gaustad et al. / Environmental Modelling & Software 60 (2014) 241e249
Once the Retrieve, Merge, Transform, Create, and Store output modules have been executed for each of the processing intervalsized chunks of data that span the requested processing period, the Finish core module updates the database and logs with process status and metrics, closes logs, sends any email messages generated, and cleans up all allocated memory. Users can access, set, and change variable values in all of the internal data structures when and how they want via the provided user hooks. 4.4.2. User hooks To allow users the opportunity to implement any possible functionality needed, hooks are available before and after each of the core modules. The available hooks are summarized in Table 3. Hooks that execute prior to the Process Data module allow developers to convert data into whatever format is most appropriate for the desired transformations. For example, to find an average wind speed, it is necessary to average the orthogonal u and v wind components and convert the final result back to wind speed and direction. A user would accomplish this by using the Pretransformation hook to go from speed and direction to u and v, and the Post-transformation hook to convert the results back to speed and direction. Plots of wind speed and direction on a data source's original 20-s time-coordinate grid, a transformed oneminute grid, and a link to the source code used to execute the transformation is available in the online Supplementary materials area in Figure S10 and Figure S11. 4.4.3. Supporting functions The purpose of the supporting functions is to provide developers access to any information relating to the process and its input and output datasets that will help them perform their analysis. The functions are written in C, but bindings to IDL and Python have been created. If the IDL or Python bindings are used, the functions are described in the context of property- and routine-associated objects to provide users with a programming environment they expect when using these languages. Descriptions of the available functions and objects are described in the online documentation (https://engineering.arm.gov/ADI_doc/ library.html). 5. Conclusions The ARM Data Integrator (ADI) framework automates data retrieval, preparation, and creation, and provides a structured algorithm development environment that expedites the development of algorithms and improves their standardization. The Data Consolidator application allows users to execute retrieval, transformation, and data creation specifications defined through a graphical interface in a single command. Non-ADI algorithms can use the Data Consolidator to create input files with the desired coordinate grid, units, variable names, metadata, and placeholder values for variables whose values will be calculated by the algorithm. This removes all data preparation and creation logic from the algorithms, allowing them to read only the variables they need to perform their calculations and then updating the existing fill values with the results. The Data Consolidator application can also be used to create simplified inputs for more sophisticated programming environments such as MATLAB and R. It can create inputs to statistical, mathematical, and scientific functionalities supported by these and other programming languages and by libraries of command line operators such as Climate Data Operators (CDO) and NetCDF Operators (NCO). Together the ADI development environment and Data Consolidator provide a niche solution well suited for implementing robust, production-ready software. Their modular design, implementation in C, and bindings to Python allow them to
use and be used by tools with more architectural flexibility and scientific libraries. The Atmospheric Radiation Measurement (ARM) program has reaped significant savings by decreasing the time needed to pre-process the various inputs into the format and coordinate system needed for its scientific algorithms. The effort required to create a complex algorithm with inputs from several datastreams on different grids has been reduced from weeks of development time to a few days. While not as easily quantified, an equally important benefit is the move away from individual developers implementing their own methods of conditioning input data to a set of versatile, robust, and well-tested library routines. This transition has resulted in substantially fewer problems being found in ARM's evaluation data sets, and decreased the time needed to integrate new algorithms into the ARM production data processing system. 6. Software availability The PCM demonstration software can be accessed free of charge at https://engineering.arm.gov/pcm/Main.html (Gaustad et al., 2014b). This provided link is to the interface currently used by ARM infrastructure to manage ongoing processing. Without signing in, the read-only link serves to demonstrate the current capabilities as well as to display live data product and processing configurations. Instructions for downloading a RHEL6 build and source code is available through a modified BSD license at https://github.com/ ARM-DOE/ADI. Acknowledgments This research was supported by the Office of Biological and Environmental Research of the U.S. Department of Energy under Contract No DE-AC05d76RL01830 as part of the Atmospheric Radiation Measurement Climate Research Facility. This project took advantage of netCDF software developed by UCAR/Unidata (www.unidata.ucar.edu/software/netcdf/). Appendix A. Supplementary data Supplementary data related to this article can be found at http:// dx.doi.org/10.1016/j.envsoft.2014.06.005. References Abiteboul, S., Cluet, S., Milo, T., Mogilevsky, P., Sime on, J., Zohar, S., 1999. Tools for data translation and integration. Bull. Tech. Comm. Data Eng. 22 (1), 3e8. Ames, D.P., Horsburgh, J.S., Cao, Y., Kadles, J., Whiteaker, T., Valentine, D., 2012. HydroDesktop: web services-based software for hydrologic data discovery, download, visualization, and analysis. Environ. Model. Softw. 37, 146e156. Apache Software Foundation, 2014. Apache Flex. The Apache Software Foundation (accessed 22. 04. 14.). http://flex.apache.org. Barseghian, D., Altintas, I., Jones, M., Crawl, D., Potter, N., Gallagher, J., Cornillon, P., Schildhauer, M., Borer, E., Seabloom, E., Hosseini, P., 2010. Workflows and extensions to the Kepler scientific workflow system to support environmental sensor data access and analysis. Ecol. Inform. 5 (1), 42e50. Bernholdt, D., Bharathi, S., Brown, D., Chanchio, K., Chen, M., Chervenak, A., Cinquini, L., Drach, B., Foster, I., Fox, P., Garcia, J., Kesselman, C., Markel, R., Middleton, D., Nefedova, V., Pouchard, L., Shoshani, A., Sim, A., Strand, G., Williams, D., Williams, D., 2005. The earth system grid: supporting the next generation of climate modeling research. Proc. IEEE 93 (3), 485e495. Berrick, S.W., Leptoukh, G., Farley, J.D., Rui, H., 2008. Giovanni: a web service workflow-based data visualization and analysis system. IEEE Trans. Geosci. Remote Sens. 47 (1), 106e113. Cermak, J., Bendix, J., Dobbermann, M., 2008. FMetean integrated framework for Meteosat data processing for operational scientific applications. Comput. Geosci 34 (11), 1638e1644. Cheng, J., Lin, X., Zhou, Y., Li, J., 2009. A web based workflow system for distributed atmospheric data processing. In: 2009 IEEE International Symposium on Parallel and Distributed Processing with Applications. http://dx.doi.org/10.1109/ ISPA, 2009.30. IEEE Computer Society. ́
K. Gaustad et al. / Environmental Modelling & Software 60 (2014) 241e249 Claypool, K., Rundensteiner, E., 1999. Flexible database transformations: the SERF approach. Bull. Tech. Comm. Data Eng. 22 (1), 19e24. Coulais, A., Schellens, M., Gales, J., Arabas, S., Boquien, M., Chanial, P., Messmer, P., Fillmore, D., Poplawski, O., Maret, S., Marchal, G., Galmiche, N., Mermet, T., 2011. Status of GDL-GNU Data Language arXiv preprint arXiv:1101.0679. Davis, G., Rew, R., Hartnett, E., Caron, J., Heimbigner, D., Emmerson, S., Davies, H., Fisher, W., 2014. NetCDF. Unidata Program Center, Boulder, Colorado (accessed 22. 04. 14.). http://www.unidata.ucar.edu/software/netcdf/. Doutriaux, C., Taylor, K., 2011. Climate Model Output Rewriter (CMOR). Program for Climate Model Diagnosis and Intercomparison (PCMDI) (accessed 22. 04. 14.). http://www2-pcmdi.llnl.gov/cmor/index1_html. Eaton, J.W., Bateman, D., Hauberg, S., 1997. GNU Octave. Free Software Foundation. Ellingson, R.G., Wiscombe, W.J., 1996. The spectral radiance experiment (SPECTRE): project description and sample results. Bull. Am. Meteorol. Soc. 77 (9),1967e1985. Eloranta, E.E., 2005. High Spectral Resolution Lidar. Springer New York, pp. 143e163. Gaustad, K.G., Shippert, T., Ermold, B., Beus, S., Daily, J., 2014a. ARM Data Integrator (ADI) Documentation. Atmospheric Radiation Measurement (ARM) Climate Research Facility (accessed 22. 04. 14.). https://engineering.arm.gov/ADI_doc/. Gaustad, K.G., Shippert, T., Ermold, B., Beus, S., Daily, J., Borsholm, A., Fox, K., 2014b. ARM Processing Configuration Manager (PCM) Application. Atmospheric Radiation Measurement (ARM) Climate Research Facility (accessed 22. 04. 14.). https://engineering.arm.gov/pcm/Main.html. Nelson, E.K., Piehler, B., Eckels, J., Rauch, A., Bellew, M., Hussey, P., Ramsay, S., Nathe, C., Lum, K., Krouse, K., Steams, D., Connolly, B., Skillman, T., Igra, M., 2011. LabKey Server: an open source platform for scientific data integration, analysis, and collaboration. BMC Bioinform. 12 (1), 71e93. Pacific Northwest National Laboratory, 2014. Institutional Computing Program (accessed 22. 04. 14.). http://pic.pnnl.gov/. PostgreSQL, 2014. PostgreSQL. The PostgreSQL Global Development Group (accessed 22. 04. 14.). http://www.postgresql.org/. Rew, R.K., Davis, G.P., 1990. NetCDF: an interface for scientific data access. IEEE Comput. Graph. Appl. 10 (4), 76e82. Rudd, T., Orr, M., Bicking, I., Esterbrook, C., 2001. Cheetah, the Python-Powered Template Engine. The Cheetah Development Team (accessed 22. 04. 14.). http://www.cheetahtemplate.org. Stokes, G.M., Schwartz, S.E., 1994. The atmospheric radiation measurement (ARM) program: programmatic background and design of the cloud and radiation test bed. Bull. Am. Meteorol. Soc. 75 (7), 1201e1221.
249
Team, R.C., 2005. R: a Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Turner, D.D., Tobin, D.C., Clough, S.A., Brown, P.D., Ellingson, R.G., Mlawer, E.J., Knuteson, R.O., Revercomb, H.E., Shippert, T.R., Smith, W.L., Shephard, M.W., 2004. The QME AERI LBLRTM: a closure experiment for downwelling high spectral resolution infrared radiance. J. Atmos. Sci. 61 (22). Turner, D.D., Clough, S.A., Liljegren, J.C., Clothiaux, E.E., Cady-Pereira, K.E., Gaustad, K.L., 2007. Retrieving liquid water path and precipitable water vapor from the atmospheric radiation measurement (ARM) microwave radiometers. Geosci. Remote Sens. IEEE Trans. 45 (11), 3680e3690. Turuncoglu, U.U., Dalfes, N., Murphy, S., DeLuca, C., 2013. Toward self-describing and workflow integrated earth system models: a coupled atmosphere-ocean modeling system application. Environ. Model. Softw. 39, 247e262. Unidata, 2011. UDUNITS. University Corporation for Atmospheric Research (UCAR), Boulder, Colorado (accessed 22. 04. 14.). http://www.unidata.ucar.edu/software/ udunits. Van Rossum, G., Drake, F.L., 2003. Python Language Reference Manual. Network Theory. Wang, D.L., Zender, C.S., Jenks, S.F., 2009. Efficient clustered server-side data analysis workflows using SWAMP. Earth Sci. Inform. 2 (3), 141e155. Whelan, G., Kim, K., Pelton, M.A., Castleton, K.J., Laniak, G.F., Wolfe, K., Parmar, R., Babendreier, J., Galvin, M., 2014. Design of a component-based integrated environmental modeling framework. Environ. Model. Softw. 55, 1e24. Williams, D.N., Doutriaux, C.M., Drach, R.S., McCoy, R.B., 2009. The Flexible Climate Data Analysis Tools (CDAT) for Multi-model Climate Simulation Data. Data Mining Workshops 2009. ICDMW ’09. IEEE International Conference on, pp. 254e261. Woolf, A., Cramer, R., Gutierrez, M., Dam, K.K.V., Kondapalli, S., Latham, S., Lawrence, B., Lowry, R., O'Neill, K., 2005. Standards e based data interoperability in the climate sciences. Meteorological Applications 12 (1), 9e22. Xie, S., Jensen, M., McCoy, R.B., Klein, S.A., Cederwall, R.T., Wiscombe, W.J., Clothiaux, E.E., Gaustad, K.L., Golaz, J.-C., Hall, S., Johnson, K.L., Lin, Y., Long, C.N., Mather, J.H., McCord, R.A., McFarlane, S.A., Palanisamy, G., Shi, Y., Turner, D.D., 2010. ARM climate modeling best estimate data. Bull. Am. Meteorol. Soc. 91 (1), 13e20. Zender, C.S., 2008. Analysis of self-describing geoscience data with netCDF operators (NC). Environ. Model. Softw., 1338e1342.