System dynamics simulations for data-intensive applications

System dynamics simulations for data-intensive applications

Environmental Modelling & Software 96 (2017) 140e145 Contents lists available at ScienceDirect Environmental Modelling & Software journal homepage: ...

2MB Sizes 0 Downloads 94 Views

Environmental Modelling & Software 96 (2017) 140e145

Contents lists available at ScienceDirect

Environmental Modelling & Software journal homepage: www.elsevier.com/locate/envsoft

System dynamics simulations for data-intensive applications Christian Neuwirth Department of Geography, University of Munich (LMU), Munich, Germany

a r t i c l e i n f o

a b s t r a c t

Article history: Received 20 January 2017 Received in revised form 29 April 2017 Accepted 15 June 2017

Simulation modeling is increasingly perceived as a methodological asset to the field of data science. Nonetheless, adequate graphical database interfaces are missing especially for most System Dynamics (SD) simulation tools. SimSyn is freely available middleware used to link together models developed in VENSIM with a PostgreSQL database for spanning SD models over geographic or multi-dimensional parameter space. The capabilities of SimSyn are demonstrated by simulating terrestrial carbon storage for 10,000 years on a 5 arc-min raster mesh with 278,115 grid cells. Results indicated the reasonable performance of data-linked simulations (7, 500 to 10,000 runs per 15 min) and a considerable increase of computational overheads associated with additional time series inputs. Apart from increasing performance, the incorporation of SD interoperability standards is a key objective for the further development of SimSyn. © 2017 Elsevier Ltd. All rights reserved.

Keywords: Simulation modeling Data science Uncertainty Parameter space Geographic space Open source software

Software availability Name of software: SimSyn Developer: Christian Neuwirth Download: https://github.com/simsynser/SimSyn Year first available: 2017 Software required: Vensim PLE, PostgreSQL, Windows OS Program language: Python Program size: 216 MB (executable file) Availability and cost: Open source 1. Introduction Ideally, sophisticated simulation tools should be able to handle the inputting of large datasets of different type. The need for such functionality is growing due to a “massive increase in the availability of informative social science data (King, 2011)” and environmental data (Lokers et al., 2016). While traditional techniques of data analysis are largely confined to variants of statistical summary, categorization and inference; data analysts may seek to add simulation techniques to their toolbox as the field of data science matures (Houghton and Siegel, 2015). In the traditionally ‘datapoor’ discipline of System Dynamics (SD) modeling (Pruyt et al., 2014), specialized tools have yet to be developed. The key

E-mail address: [email protected]. http://dx.doi.org/10.1016/j.envsoft.2017.06.017 1364-8152/© 2017 Elsevier Ltd. All rights reserved.

objective of this ongoing work is to extend the set of data categories and scales exploitable by traditional SD process simulations. Essential data categories include time series, lookups and subscripts. The use of subscripts refers to the assignment of model parameter values to configure different model parameterizations. This approach is typically used to show many alternative futures, to span uncertainty space (Pruyt et al., 2014) or for spatial replication of model structures. The coupling of simulation tools to specialized database software for sophisticated digital archiving, querying and analysis of input and output data is a highly plausible approach for running this type of simulation. Up to now, this is not entirely supported by appropriate database interfaces in standard SD software. Whereas the input of time series or lookups is enabled by ODBC in the proprietary DSS version of VENSIM (Ventana Systems, 2016) or by csv-files transfer in STELLA (Pierson, 2011), subscripting types are not well-established. In VENSIM, for instance, subscripts need to be typed in manually which limits large scale applications. Unlike Vensim, software such as SIMILE or NOVA provide more sophisticated forms of model disaggregation (Muetzelfeldt and Massheder, 2003; Salter, 2013). Moreover, SIMILE has functionality for loading spreadsheet and image data as one- or multidimensional array. Also the future integration of features for collecting inputs from geographical information systems is announced by NOVA (Salter, 2013). Yet, the implementation of database connectivity features didn't receive as much attention as methods of data import from files. On the

C. Neuwirth / Environmental Modelling & Software 96 (2017) 140e145

contrary, simulation software like AnyLogic or Powersim support database connectivity. Nevertheless, the creation of model instances from database data is not supported by built-in functions. Alternatively, libraries like EMA Workbench for Python (see Kwakkel and Pruyt, 2015) in conjunction with specialist database APIs can be used to get control over traditional SD and database software. Also Software like StellaR is capable of translating conceptual SD diagrams into a more flexible scripting environment (Naimi and Voinov, 2012). Nonetheless, the high efforts associated with script development and editing to tightly couple simulation and database systems constitute a major drawback of this approach. SimSyn is a freely available graphical user interface (GUI), which links VENSIM to a PostgreSQL database. This enables an efficient subscripting of VENSIM models, while removing the need for timeintensive coding. Possible applications are local process simulations - i.e. models are parameterized with values at spatial locations (lateral interaction and flow is neglected in this type of model), comprehensive scenario testing, model calibration or sensitivity and uncertainty analyses. The following section describes the functionality of SimSyn in more detail. Subsequently, an application of SimSyn is demonstrated through simulating the effects of prehistoric anthropogenic land cover changes on terrestrial carbon storage. This is followed by the presentation of simulation results and results of performance testing. The article concludes with a summary and outlook to future improvements of performance, interoperability and usability.

141

Fig. 3. Subscripting data link between table column ‘Col.1’ and model parameter ‘Rate’: Database values S1 to Sx are individually assigned to model parameter ‘Rate’ to create alternative model configurations.

2. The SimSyn middleware SimSyn coordinates the interaction of VENSIM and PostgreSQL by means of a tight coupling approach (see Fig. 1). While commands from PostgreSQL are invoked in code through the psycopg2 database adapter for Python, VENSIM PLE models are translated and encapsulated in Python classes using PySD (see Houghton and Siegel, 2015).

Fig. 1. Schematic representation of system design.

Fig. 2. SimSyn interface.

142

C. Neuwirth / Environmental Modelling & Software 96 (2017) 140e145

In this way, components from both platforms were embedded in a GUI which enables the creation of ‘Data-Linked Models’ in three steps: (1) Build a model in VENSIM and upload the data to the PostgreSQL database, (2) Load the VENSIM model into SimSyn and connect the database to SimSyn, (3) Define Data Links by matching

Fig. 4. Time series data link between table column ‘Col.1’ and model parameter ‘Rate’: Database values t1 to tx are assigned as a function of time to model parameter ‘Rate’.

model parameters with database columns in SimSyn. Once the data-linked model is created, the SimSyn GUI gives control over model execution, resetting and editing of data links (see Fig. 2). SimSyn distinguishes between ‘Subscripting Data Links’ (see Fig. 3) and ‘Time Series Data Links’ (see Fig. 4). A data link is a reference between a specific model parameter and a single column of a database table. In case the option ‘Subscript’ is selected in SimSyn, one alternative model parameterization is created for every record in the column. The option ‘Time Series’ is used to assign a function of time to a model parameter. A data-linked simulation may have multiple time series links to one or multiple database tables. While the number of subscripting links may also be greater than one, links must refer to only one table. More than one subscripting table would increase the risk of inconsistent row numbers. A discrepancy in row numbers of two or more subscripting links leads to undefined parameters, which will eventually end up in a runtime exception. Nevertheless, a datalinked simulation may be based on combinations of subscripting data links and time series data links. As an additional functionality, subscripting links may be optionally equipped with a simulation timestamp, i.e. the data value is assigned to a specific point in time in the dynamic simulation model. In case multiple ‘Time-Dependent Subscripting Links’ are set, SimSyn will linearly interpolate between the given values in the course of its integration (Houghton and Siegel, 2015). Also time

Fig. 5. A) Left: Values of ‘Col.1’ are assigned as subscripts S{subscript number} to a model parameter; A) Right: Outputs of subscripted simulations are structured as t{simulation time step number}{subscript number} B) Left: Values of ‘Col.1’ are assigned as time inputs t{time step number} to a model parameter; B) Right: Outputs of the time series simulation are structured as values over time V{simulation time step number}.

C. Neuwirth / Environmental Modelling & Software 96 (2017) 140e145

143

Fig. 6. Left: Time step 9000 (1000 BP) is retrieved from the SimSyn output table (retrieved value is in red), variable indices are structured as t{simulation time step number}{subscript Right: Map visualization of retrieved value.1 (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

number};

series data is interpolated by the time step defined in the Vensim model. Results of simulation runs are written to a new database table. The structure of the output table depends on data inputs (see Fig. 5). In case subscripting links are involved, one array of discretetime outputs is created for every simulation run and one column is added to the table for every state variable (system stock) of the SD model. In contrast to subscripted simulations, simulation outputs based on time series data only are stored as simple scalars. Simulation time steps are indicated by table row numbers. The number of table columns in turn corresponds to the number of state variables in the model. In order to enable efficient querying of database datasets, table records are matched by key columns in input and output tables. Additionally, input and output table records can be matched by spatial location in instances where spatial input data is used. The presented database structure enables numerous data querying operations. The most common operations, recommendations on data export, system setup as well as advices on error handling were summarized on GitHub Wiki (https://github.com/simsynser/ SimSyn/wiki).

carbon model over the 2-dimensional ALCC meshes. One more subscripting data link was created to initialize the model parameter ‘latitude’ by a 5 arc-min latitudinal grid. Furthermore, a fictitious variation of orbital geometries and solar radiation over time is initialized from database by assigning a 10,000-year time series to the model parameter ‘solar factor’. Overall the model is assigned twelve spatial grids and one time series. A more detailed model description is provided together with differential equations and model structure on GitHub.

3. Use case The use of SimSyn is demonstrated by modeling carbon in biomass from 11,000BP (before present) to 1,000BP on a 5 arc-min grid resolution. The study area extends over East Europe, Southeast Europe and Middle East and comprises overall 278,115 grid cells. Main reason for selecting this scene was due to early adoption of agricultural practices, associated vegetation clearing and thus anthropogenic carbon sink depletion in this region. The primary driver of carbon stock depletion in the model was a gridded time series of Anthropogenic Land Cover Change (ALCC) provided with a database named HYDE 3.1 (Klein Goldewijk et al., 2011). The eleven layers contained in the database cover prehistoric and historic ALCC ratios with a 1000-year temporal resolution. The structure of the SD model itself is a simplified version of the CO2 dynamics model presented in Bossel (2004) and includes processes of carbon absorption due to net primary production as well as carbon loss in consequence of plant litter and animal respiration. The subscripting function of SimSyn was used to span the

1 Results of the carbon model are intended to highlight application potentials of SimSyn. Outcomes may not be interpreted quantitatively due to a lack of model calibration and testing.

Fig. 7. A) A subset of data from the simulation output table (right) is queried by attribute ‘ALCC 4000’ (queried records are in red) in the input table (left), variable indices are structured as t{simulation time step number}{subscript number} and S {subscript number}; B) Visualization of retrieved outputs.1 (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

144

C. Neuwirth / Environmental Modelling & Software 96 (2017) 140e145

4. SimSyn outputs The simulation produced an output table with roughly 2.7 billion values in three dimensional data space. This table may be queried for visualization by temporal or subscript attributes. In this example, as subscript attributes correspond to spatial location (grid cells), retrieving one specific time instance of every data row yields a map which represents a snapshot of relative carbon storage (see Fig. 6). Alternatively, fluctuations of relative carbon at spatial locations may be presented as a function of time (see Fig. 7). 5. Performance testing SimSyn was developed as a tool for spanning data space by independently operating simulation runs. In order to evaluate software capabilities and limitations in terms of simulation scale, four different performance tests were conducted: Test Scenario A e One subscripting link was defined to span the carbon model (see section 3) over an oscillating factor m (represents a number of fictitious orbital geometries). Test Scenario B e Two subscripting links were defined to span the carbon model over the spatial layers ‘ALCC at 1,000BP’ and ‘latitude’. Test Scenario C (equivalent to use case run) e Twelve subscripting links were defined to span the carbon model over the spatial layers ‘ALCC 11,000BP to 1,000BP’ and ‘latitude’.

Test Scenario D e Twelve subscripting links were defined to span the carbon model over the spatial layers ‘ALCC 11,000BP to 1,000BP’ and ‘latitude’. Moreover, every simulation run was linked to a time series of an oscillating factor m over 10,000 time steps (fictitious dynamics in orbital geometry). Scenarios A to D were run on DELL Latitude E6440, i7-4610M CPU, 8 GB RAM and Windows 7 64-bit. Results indicated a relation close to linear between computation time per simulation run and number of subscripting links for small numbers of subscripting links (compare scenario A and B in Fig. 8). However, the same relation scaled sub-linearly with significant increase of numbers of subscripting links (compare scenarios A, B with C, D in Fig. 8). In other words, the simulation becomes more efficient with increasing number of subscripting links involved in the simulation. This can be explained by the nature of the data writing operation. While computational overhead of data reading is related to the number of selected data links, data writing is independent of links. Likewise, data writing exceeds computation time of data reading by a factor of about 100. This implies that relative overhead significantly decreases with increasing numbers of data links. Moreover, considerable computational overhead was imposed once time-series were involved in subscripted simulations (compare scenarios C and D in Fig. 8). Also variance of computation time per simulation run increases with larger numbers of subscripting links involved in the simulation (compare scenarios A, B with C, D in Fig. 8).

Fig. 8. Computation time required for model parameterization from database, model execution and writing outputs to database (n ¼ 2321) in respective scenarios A, B, C and D.

C. Neuwirth / Environmental Modelling & Software 96 (2017) 140e145

6. Conclusion and outlook The key objective of this ongoing endeavor is to foster datadriven simulation modeling in order to enable more realistic process representations. The middleware presented in this article called SimSyn e constitutes a first step towards more elaborate data integration in a process modeling context. SimSyn provides the capabilities and an easy to use GUI to link SD models to database tables and to configure model parameters from database data. The subscripting functionality of SimSyn is capable of spanning multidimensional data spaces, which is used to run comprehensive sensitivity testing and local process simulations. The use of SimSyn has been demonstrated by running a carbon simulation in VENSIM for 10,000 years on a 5 arc-min raster mesh with 278,115 grid cells. . Depending on data links included in the performance test scenario (see section 5) a single run takes from 0.06 up to 0.29 s on a conventional notebook. Computational overhead moderately increased with increasing number of subscripting links. About 10,000 runs could be performed in 15 min, if only one subscripting link is defined in the simulation. Approximately 7500 runs could be realized in the same period of time, if twelve subscripting links were included. Considerable performance loss is associated with additional time series input in subscripted simulations. One additional time series (10,000 time steps) in a scenario with twelve subscripting links resulted in 5300 runs executed in 15 min. Extensive execution times in this scenario can be traced back to additional time required for loading entire time series in every individual simulation instance. Data compression is considered to be a promising solution to existing performance constraints. In the medium run SimSyn will be complemented with a compression algorithm, which makes use of similarities and sparse matrix characteristics of data inputs. Further potential improvements mainly relate to interoperability and usability. At the moment the applicability of SimSyn is restricted to models developed in Vensim. A more generic, platform

145

independent middleware design, based on the XMILE standard for system dynamics (Eberlein and Chichakly, 2013), is planned for future versions of SimSyn. Moreover, the functionality will be expanded continuously. Highest priorities will be given to data visualization and to the development of a complementary data link for assigning time series as subscripts. References € Bossel, H., 2004. Systemzoo 2: Klima, Okosysteme und Ressourcen. BoDeBooks on Demand. Eberlein, R.L., Chichakly, K.J., 2013. XMILE: a new standard for system dynamics. Syst. Dyn. Rev. 29, 188e195. Houghton, J., Siegel, M., 2015. Advanced data analytics for system dynamics models using PySD. In: Proceedings of the International Conference of the System Dynamics Society. Presented at the International Conference of the System Dynamics Society, Boston. King, G., 2011. Ensuring the data-rich future of the social sciences. Science 331, 719e721. Klein Goldewijk, K., Beusen, A., Van Drecht, G., De Vos, M., 2011. The HYDE 3.1 spatially explicit database of human-induced global land-use change over the past 12,000 years. Glob. Ecol. Biogeogr. 20, 73e86. Kwakkel, J.H., Pruyt, E., 2015. Using system dynamics for grand challenges: the ESDMA approach. Syst. Res. Behav. Sci. 32, 358e375. Lokers, R., Knapen, R., Janssen, S., van Randen, Y., Jansen, J., 2016. Analysis of Big Data technologies for use in agro-environmental science. Environ. Model. Softw. 84, 494e504. Muetzelfeldt, R., Massheder, J., 2003. The Simile visual modelling environment. Eur. J. Agron. 18, 345e358. Naimi, B., Voinov, A., 2012. StellaR: a software to translate Stella models into R open-source environment. Environ. Model. Softw. 38, 117e118. Pierson, N., 2011. Connecting IThink and STELLA to a Database [WWW Document]. URL. http://blog.iseesystems.com/stella-ithink/connecting-ithink-and-stella-toa-database (Accessed 23 December 2017). Pruyt, E., Cunningham, S., Kwakkel, J., De Bruijn, J., 2014. From data-poor to datarich: system dynamics in the era of big data. In: Proceedings of the International Conference of the System Dynamics Society. Presented at the International Conference of the System Dynamics Society, Delft. Salter, R.M., 2013. Nova: a modern platform for system dynamics, spatial, and agent-based modeling. Procedia Comput. Sci. 18, 1784e1793. Ventana Systems, 2016. Vensim Help. Connecting to Databases with ODBC [WWW Document]. URL. https://www.vensim.com/documentation/index.html?users_ guide.htm (Accessed 23 December 2016).