Computers and Electronics in Agriculture 161 (2019) 14–28
Contents lists available at ScienceDirect
Computers and Electronics in Agriculture journal homepage: www.elsevier.com/locate/compag
Original papers
Towards integration of data-driven agronomic experiments with data provenance
T
⁎
Sérgio Manuel Serra da Cruza, , José Antonio Pires do Nascimentob a b
Federal Rural University of Rio de Janeiro (UFRRJ), Department of Computer Sciences, BR-465, Km 7 – Room 80 – P1, CEP. 23.897-000, Seropédica, RJ, Brazil Brazilian Agricultural Research Corporation (Embrapa), BR-465, Km 7 – Bairro Ecologia, CEP: 23891-000, Seropédica, RJ, Brazil
A R T I C LE I N FO
A B S T R A C T
Keywords: Agriculture Provenance Scientific workflows Reproducible research Scripts Data analysis
With improvements in computing and communications, the amount of scientific data in agriculture has been exploding. Thus, researchers must rely on computational simulations to model the data-driven in silico agronomic experiments, the in silico experiments are those that are completely executed by using computer models. Reproducibility, transparency, independent verification are major features of Science. However, even agricultural research of exemplary quality may have irreproducible empirical findings because of random or systematic error. Funding agencies, researchers, and reviewers are demanding improved processes and the use of open data to increase reproducibility of those experiments. Currently, there are no scientific criteria to evaluate the integration of data-driven agronomic experiments with data provenance. We propose RFlow, a framework that aid researchers to manage, share, and enact the scientific in silico experiments of research projects that use reusable R scripts. The framework uses open data standards and transparently captures provenance of the agronomic experiments. RFlow is non-intrusive, can be connected to workflow systems and does not require researchers to change their working way. Our computational experiments show that the framework can collect provenance metadata and enrich a scientific project. This study shows how RFlow can serve as the primary integration platform for statistical systems, like R, with implications for other data and compute-intensive agronomic projects. As a proof of concept, we show the concrete effectiveness and expressive power of the RFlow which was evaluated through a set of data-driven agronomic in silico experiments and provenance SQL queries that exemplifies what kind of information was gathered.
1. Introduction Feeding the 9 billion individuals expected to inhabit our planet by 2050 will be an extraordinary intersecting challenge. According to Godfray et al., (2010), agriculture has little time to be transformed; researchers should work out how to grow more food without exacerbating environmental problems, coping with climate changes and the rational use of natural resources. This combination of factors poses novel and complex challenges for global agriculture. Thus, such concerns are demanding the integration of various research fields like Agricultural Science and Computer Science to face them (Tuot et al., 2008; Cho et al., 2010). The use of data-intensive technologies such as the Internet of things, precision agriculture, sensors, robots, drones and, driverless tractors in the fields together with the use of modern experimental apparatuses in laboratories are demanding new research perspectives to ensure reproducibility of experiments. Such approaches require the use of modern data
⁎
management and data curation and data preservation techniques (Dou et al., 2012, Katz and Zhang, 2014; Driemeier et al., 2014). Deep intellectual contributions are now encoded as scripts. Besides, datasets are the key assets in data-driven research projects in any fields of endeavor; However, most of these elements are lost when the research is poorly documented as disconnected computer files (van Evert et al., 2008; Gabriel and Capone, 2011). Currently, researchers frequently start thinking about making their research projects reproducible close to the end of their investigations when they are submitting their manuscripts. Others consider it even later when a journal article requires their datasets and scripts be available for publication (Fomel and Claerbout, 2009; Gandrud, 2015; Pasquier et al., 2017). Furthermore, most researchers do not deposit datasets, scripts, and software packages related to their research project in open repositories; and if they do so, they are often deposited on their personal or institutional websites, lacking consistency, discoverability, and curation. Reproducibility, provenance, reliability, and independent
Corresponding author. E-mail addresses:
[email protected] (S.M.S.d. Cruz),
[email protected] (J.A.P.d. Nascimento).
https://doi.org/10.1016/j.compag.2019.01.044 Received 2 December 2017; Received in revised form 7 September 2018; Accepted 26 January 2019 Available online 28 February 2019 0168-1699/ © 2019 Elsevier B.V. All rights reserved.
Computers and Electronics in Agriculture 161 (2019) 14–28
S.M.S.d. Cruz and J.A.P.d. Nascimento
verification are cornerstones of Science. Tracking provenance for agronomic research is essential. To Zhou and Talburt, (2015), highquality data are fundamental; they form the foundation for good scientific decisions making, wise management, and rational use of experimental resources. The provenance of data is metadata (Buneman et al., 2000, Simmhan et al. 2005; Freire et al., 2008). Metadata is a set of data that gives information about other data. Provenance captures the derivation history of a dataset, including the original and intermediate datasets, and the activities of an experiment and enriches scientific data (Buneman et al., 2000). The difficulties for leveraging data-driven research projects in agriculture are many (Borgman, 2012). The main issues are: (a) the lack of scientific data citation standards and suitable association to the research article with scientific data; (b), the lack of incentives to share research data; (c) the lack of appropriate infrastructure to store and preserve data; (d) difficulty to share datasets; and the difficulty of defining “data” and “experiment” in any given research endeavor. According to several authors like Stodden et al. (2014), Michener, (2015), Buneman et al., (2016) and Yeumo et al. (2017), the overwhelming amounts of data are contributing to making in silico experimentation hard to be reproduced, cited or published if disconnected from its original datasets, scripts, and protocols. The in silico expression is not new. It is quite common in E-Science domain but not in Agriculture and Agronomy. It characterizes those studies that are completely executed by using computer models. It means that the environment, object, and subject behavior are represented by computational models that simulate their relevant features. In this case, the environment is entirely composed of numeric models to which no human interaction is allowed (Travassos and Barros, 2004). In silico experimentation, also known as computational research. It provides researchers with many significant advantages, such as higher precision and improved quality of experimental data; better support for data-intensive research and access to massive sets of experimental data generated by scientific communities. Despite the advantages, in silico experimentation suffers from the increased complexity of setting up, managing, maintaining and making changes to the experimental simulation. Thus, the interaction with such a system involves a significant amount of purely computing aspects that sometimes are hard to be set up by scientists with a low background in computer sciences.
demarcation is not clear-cut, and they are separated by a boundary where ‘laboratory and field practices can meet and mingle’. The first practice is based on well-defined experimental research methods that are time- and labor-intensive using on-the-ground and field-scale experiments which may last for years to collect raw data. The second practice uses scientific programs, workflows, and statistical analysis to compute the collected data. These analyses are called for a systematization of information and usually requires computational effort as a supporting act to the experimental research done in the fields (Parolini, 2015). However, most of the computational effort is centered in poorly documented data repositories (e.g., spreadsheets, text files, relational tables or unstructured annotations) and is hidden in scripts that go to produce the scientific results. These scripts are hard-coded by researchers using a textual programming language described by a grammar analogous to programming languages like Bash, Perl, R, and Python. The scripts often have extended syntax and complex semantics. The scripts are not easier to be shared, validated, curated and maintained by unskilled users. Last but not least, these scripts seldom documented or take advantage of the existing mechanisms that collect provenance (McPhillips et al., 2015; Murta et al., 2015). The importance of provenance in reproducible computational research is well documented in the literature. The provenance of datadriven in silico scientific experiments enables researchers to analyze the quality, to verify the authorship and to reproduce the achieved results. According to Zhou and Talburt (2015) the data standards (e.g., ISO 8000 family of standards), data definitions, data requirements, data quality information, data provenance, and business rules are all forms of metadata. Due to the amplitude of the subject “data quality”, this work is focused on one type of metadata: data provenance. The data provenance allows researchers or referees to obtain accurate evidence about when, where and how activities were executed by whom and which computations were applied to the datasets (Bowers et al., 2008, Aldeco-Pérez and Moreau et al., 2010). Like many authors, we stress that the use of open data together with provenance, are key elements to address the issues of reproducibility of the in silico experiments (Tuot et al., 2008, Mesirov, 2010, Brooks et al., 2011; Cruz and Nascimento, 2016). Those elements may aid scientists to interpret results, provide new hypotheses and validate the agronomic practices which produce lots of data whose value, and quality is easier to evaluate.
1.1. Research problem
1.2. Contribution
Managing and integrating data-driven scientific projects is a complex task (Pasquier et al., 2017). Currently, the increasing complexity of the datasets of the in silico experiments are raising alarms about scientific results that cannot be reproduced in several scientific domains, including agriculture. Besides, a growing number of retractions, a few linked to fraud, have helped stir concerns about irreproducible results in many research fields (Ioanidis, 2005, Baker, 2016; Mervis, 2016). All this has contributed to the systematic increase in the use of provenance metadata in data-driven scientific projects (Freire et al., 2008, Hey et al., 2009, Mitchell et al., 2012, Sandve et al., 2013, Cruz and Nascimento, 2016, Buneman et al., 2016; Cruz et al., 2018a). As researchers need to compute increasingly large volumes of data, the use of scientific workflows focusing on the analysis of variance and experimental design of agronomic experiments is witnessing a rapid increase (Morales et al., 2013; Mullis et al., 2014). Scientific workflows play a crucial role in much of today’s Science; they are widely recognized to be an important paradigm in the computing arena (Katz and Zhang, 2014). Briefly, the scientific workflows can be thought of as data pipelines consisting of data transformations activities connected to one another to produce data products (Kashlev and Lu, 2014). According to Kohler (2002), modern agronomic experiments are composed of two complementary practices (long-term field experiments and short-term laboratory experiments). The author stresses that such
The motivation for our study is to bring a novel approach to benefit researchers to expand the computational reproducibility of in silico data-driven agronomic experiments. Hence, we consider the reuse of the scripts and provenance management to attain reproducibility. Besides, we developed a framework that uses workflows to preserve the scripts from modifications and to collect two kinds of provenance of the experiments, saving this knowledge together with experimentś data in repositories compatible with W3C PROV recommendation (Moreau and Groth, 2013). Accordingly, in this paper, we seek to expand the understanding of data-driven agricultural studies. We present RFlow, a framework which can be used to manage the data of agronomic research projects and it is in silico experiments. RFlow is supported by scientific workflows that wrap and reuse existing R scripts. Besides, it aims to mitigate the restrictions of the current statistical systems about the transparent and non-intrusive generation of provenance metadata. Also, it may enhance the reproducibility of these experiments, aiding researchers to share more trusted research data. We highlight the main contributions of this paper. First, we develop an unobtrusive, automated approach for capturing prospective and retrospective provenances within statistical workflows. Second, we describe the RFlow framework that allows researchers (re)use the R scripts encapsulated by scientific meta-workflows also permit the sharing of agronomic experiments results. 15
Computers and Electronics in Agriculture 161 (2019) 14–28
S.M.S.d. Cruz and J.A.P.d. Nascimento
Furthermore, we provide a motivating case study, presenting tooĺs usage in a scenario at Embrapa (EMBRAPA, 2017), which suggest that our approach is both efficient and useful in researcherś daily duties. Following this introduction, the remainder of this paper is structured as follows. Section 2 provides background about the key topics in data-centric research. Section 3 presents RFlow. Section 4 presents a case study and experiments. Section 5 discusses the related works. Finally, Section 6 concludes this paper. 2. Background A few decades ago, the results of an agronomic research project at the frontier of agricultural science could be summarized in a handful of numbers, and the published paper of such an experiment contained the necessary data to reanalyze the results (White and van Evert, 2008). However, in todaýs world, newer methodologies such as in silico experiments and scientific workflows not only permit reviewing results from past agronomic experiments in novel ways but also generating new results that may comprise many gigabytes of detailed amounts of binary data. These trends increase the value of data from agricultural research and make trust data an essential asset for the entire agricultural community.
Fig. 1. Conceptual representation of the in silico scientific experiment life cycle.
define the protocol and the structure of the scientific experiment in a high level of abstraction. This definition should allow them to make choices for methods to support their experiment’s hypothesis, i.e., the logical sequence of activities of a scientific workflow, types of input/ output data and parameters that must be provided. Then, during the execution phase, scientists should be able to configure concrete scientific workflows that correspond to their in silico experiment composition. In the second phase, the concrete workflows are implemented and further executed, the activities (i.e., programs) are chosen and chained to represent the methodology defined in the experiment. The workflows can be executed by the Scientific Workflow Management System (SWfMS) in a particular computational environment generating datasets and results. Finally, the analysis phase supports the interpretation and evaluation of the hypotheses and data. This phase is highly dependent on data and the configuration of the experiment that was generated in the previous phases. Besides, the scientist must consider the decisions made during the composition phase and execution of different workflow trials of the STE to prepare his manuscripts. In data-driven in silico experiments, supporting data management, data curation and data preservation of these phases of the experiment becomes fundamental. Thus, the provenance must be collected during all the three phases of the in silico experiment. Data provenance is a formal representation of computational processes (Freire et al., 2008), it records the history of the computational processes of the entire experiment, since its specification up to the execution times of each process. Within these metadata, scientists can audit executions and augment the reproducibility of the STE. Nowadays, many data-driven agronomic and agricultural STE are not able to collect the provenance metadata. However, if they have adopted the in silico scientific experiment life cycle, after the analysis phase, the scientist would be able to publish not only his findings but also the dataset and the provenance of the whole experiment. Such knowledge could be published and shared in a peer-reviewed journal (dotted line). The usefulness of provenance in Science is greater than ever. Producers, scientists, referees, and funding agencies are seeking for precise and trustful information. Data quality is central to scientific practice in any area. Providing full details about one’s work allows other researchers to verify, replicate, and extend that investigation (White and van Evert, 2008). Many agronomic and agricultural researchers are willing to share their dataset. However, there is limited incentive to do so. First, organizing and documenting datasets is time-consuming and costly, especially when data reside on aged, obsolete media. Second, much less recognition is received for making a dataset available as compared with publishing research findings in a peer-reviewed publication. Finally, there are very few works regarding the use of open data and provenance metadata in agriculture (White and van Evert, 2008, Cruz et al. 2009; Cruz and Nascimento, 2016).
2.1. Agronomic experiments An agronomic experiment is usually associated with a scientific method for testing certain agricultural phenomena (Matt, 2011). The design of agronomic experiments may vary regarding its goals, practices, species, apparatus, and duration in time (Kohler, 2002). Researchers may conduct essays which are characterized as long-term experiments (LTE) and short-term experiments (STE). The LTE are traditional in agriculture; they have been running in the fields for years in various parts of the world for the last 175 years (i.e., Rothamsted Experimental Station in England) and still need more time to execute the research procedures. The main advantage of LTE is that it allows the quantification of the impacts of management practices on soil processes which may be slow regarding changes (i.e., pH, soil organic matter, microbial diversity) but are essential in terms of sustainable agriculture. The disadvantages of LTE are that because management practices change over time, they do not always reflect current farming practice. Moreover, because they have no defined ‘end’ to the project, many results have not been adequately published in papers or disseminated in a form suitable for extension. The major disadvantage is their cost relative to their perceived returns (Crawford et al., 2003). On the other hand, the STE can be performed in a few weeks, months or years and have the potential to contribute to the improve LTE. The STE can be executed either in the fields (i.e., 1–5 years) or in wet or dry laboratories in which the essays can be characterized as in vitro and in silico experiments respectively (Crawford et al., 2003). A significant difference between STE and LTE is that the endpoint is quite clear in STE, whereas it is not in ongoing experiments. The endpoint provides a trigger for preparing the manuscripts or documenting the experiment, and this endpoint does not exist in LTE. Furthermore, STE generates more data in small periods of time when compared with the LTE. Thus, it is essential to deliver to the agricultural community a novel computing infrastructure that can share raw and curated data and the provenance of agronomic STE and LTE and augment the reproducibility of the experiments. 2.2. In silico scientific experiment life cycle According to Mattoso et al. (2010), an in silico scientific experiment has a short life cycle which consists of multiple phases and interactions carried out by researchers. In Fig. 1 we present a simplified version of the life cycle, the main phases can be identified as composition, execution, and analysis. The first phase is the composition phase; researchers 16
Computers and Electronics in Agriculture 161 (2019) 14–28
S.M.S.d. Cruz and J.A.P.d. Nascimento
2007; Mullis et al., 2014). Currently, there are dozens of SWfMS providing facilities for the implementation of applications in a variety of scientific domains. SWfMS are general-purpose tools that can be executed either on a standalone workstation or distributed environments (Liu et al., 2014). Among them, we highlight Kepler system (Ludäscher et al., 2006; McPhillips et al., 2009). Kepler allows researchers to execute the in silico experiments in different research domains. Furthermore, it offers built-in components to support to provenance gathering and connection to R system. Kepler was designed to help researchers to conceive, execute, and analyze data across a broad range of scientific disciplines. Kepler allows the researcher to plug in different execution models into workflows, it has actor-oriented modeling paradigm whose actors run as local Java threads, but can be extended to spawn distributed execution threads via the Web; RFlow framework explored such technical features. Besides, Kepler contains libraries of reusable actors (activities) that perform several computations such as a statistical analysis that can be easily integrated with R systems. Currently, Kepler offers several runtime provenance engines to capture and track one kind provenance metadata (e.g., retrospective provenance) of the workflow executions (Altintas et al., 2006). However, they often rely on proprietary formats that make the interchange of provenance information difficult.
2.3. Open data in agriculture Open Data (OD) and open-source software enabled the fast implementation of novel computational methods to manage and analyze the growing amount of scientific data. OD is defined as the data that anyone can access, use or share (Open Definition, 2017; OKFN, 2017). The benefits of OD are diverse and range from improved efficiency of governments, businesses, individuals to wider social welfare. Even researchers can use OD to enhance their investigations. OD has recently been gaining traction in agriculture. The key OD principles can be summarized as follows (OKFN, 2017): (a) Availability and Access: the data must be available as a whole and at a reasonable cost, preferably by downloading over the Web. The data must also be available in a convenient and changeable format; (b) Reuse and Redistribution: the data must be provided beneath terms that permit redistribution and reuse including the intermixing with other datasets; (c) Universal Participation: everyone must be able to use, reuse and redistribute the data. There should be no discrimination against fields of knowledge or persons or groups. The agricultural and agronomic research uses and produces many datasets while investigating agricultural systems. However, few of these datasets follow the OD principles (Yeumo et al., 2017, CGIAR, 2017, AIMS, 2017, GODAN, 2017). OD overcomes the boundaries which traditional information distribution reaches. In the agricultural sector, efforts are being made to exploit the potential of this approach, in particular about global food security and international research collaboration. Currently, funding agencies like National Science Foundation (NSF, 2017) and agricultural research agencies like Consultative Group for International Agricultural Research (CGIAR, 2017), Agricultural Information Management Standards (AIMS, 2017), Global Open Data for Agriculture and Nutrition (GODAN, 2017), and the International Council for Science’s Committee on Data for Science and Technology (CODATA, 2017) are encouraging the use of OD in agricultural and agronomic research. The agencies advocate that OD provide better resources and support for authors wishing to better support reproducible research. As far as we are concerned, the use of OD in our research provides several benefits to researchers. The main benefits are (a) allowing greater access to datasets and opening opportunities to build upon and create novel and innovative research from publicly accessible data; (b) working smarter and faster by sharing and analyzing data in real time; (c) allowing in silico experiments to be reproduced and verified by anyone; (d) increasing researcher transparency and reducing academic fraud; (e) maximizing data reuse by selecting formats for efficient data extraction and, (f) ensuring compliance with funding agencies directives and journal publishing policies.
2.5. Data provenance Provenance is defined as metadata records that describe the process, people, entities, institutions, and activities involved in producing, influencing or delivering a piece of data or a thing (Buneman et al., 2000, Davidson and Freire, 2008; Moreau and Groth, 2013). To Freire et al., (2008), the workflow provenance may be as (or more) important as the in silico scientific experiment. In recent years, the usage of provenance has extrapolated for other areas such as open data (Hartig and Zhao 2010), security and privacy (Martib et al. 2012), reproducible research (LeVeque et al., 2012), business processes (Cruz et al. 2013), among others. However, we stress that there are few provenance-based papers in agricultural and agronomic research (Cruz and Nascimento, 2016, Cruz et al., 2018a). The provenance can be classified into two types: prospective and retrospective (Davidson and Freire, 2008; Freire et al., 2008). Prospective provenance comprises an abstract process specification as a recipe for future data derivation. Unlike a specification, prospective provenance is, in general, independent of a model and it is intended to capture the recipe in an abstract and informative form to allow further querying of this information. Retrospective provenance consists of a structured and detailed history of the execution of a workflow and data derivation information, i.e., which activities were performed and how data were derived. Provenance metadata can be automatically captured during a workflow execution by the SWfMS or third-party provenance systems (Cruz et al., 2009). In this work, we adopted is the W3C PROV recommendation to storage, trace, share, and exchange of provenance metadata (Moreau and Groth, 2013). The goal of PROV recommendation is to enable the publication and interchange of provenance on information systems. It enables one to represent and interchange provenance metadata using available formats such as XML and RDF. The recommendation has three core concepts: Entity, Agent, and Activity. Each concept enables users to map full traces of workflows runs in W3C-PROV format. This foster the delivery of a final research product to institutional archives, where provenance is included in their data-curation processes. The recommendation allowed us to develop a standardized relational data model for storage and query prospective and retrospective provenance collected during the life cycle of the in silico experiments executed in RFlow. Although a detailed discussion about data quality and its categories (e.g., master data, transaction data, metadata, and reference data (Zhou
2.4. Scientific workflows and Kepler system By definition, a scientific workflow is the automation of a scientific hypothesis, during which logical activities process datasets according to a set of user-defined rules (Yu and Buyya, 2005; Liu et al., 2014). The workflow defines the experiment; they describe all the computational resources required to execute the activities. The facilities offered by scientific workflows include rapid and easy design, reuse and documentation, scalable execution, sharing and collaboration, and other benefits that altogether enable reproducible computational research (Deelman et al., 2009, Prabhu et al., 2011, LeVeque et al., 2012; Sandve et al., 2013). Scientific Management Workflow System (SWfMS) is a framework for scientific workflow enactment. It is an efficient tool to execute the workflows and manage the datasets in various computing environments (Kashlev and Lu, 2014). It is important to stress that SWfMS allow researchers to carry out a kind of high-level programming through the activities that follow a certain logic. (Callahan et al., 2006, Oinn et al., 17
Computers and Electronics in Agriculture 161 (2019) 14–28
S.M.S.d. Cruz and J.A.P.d. Nascimento
scripts. Finally, research datasets and provenance metadata can be made publicly available as OD to the researchers or third-party fellows to fulfill some purposes such as visualization, browsing, querying and auditing through a Web Interface named SisGExP (Nascimento and Cruz, 2013). The rationale of RFlow is to isolate the researcher from time-consuming (re)codifications of R scripts, allowing them to focus the research itself. RFlow encapsulates R scripts as generic statistical scientific workflows (meta-workflows) in an SWfMS (without the necessity of changing scriptś original code or embed functions to collect the provenance) in a transparent way (Nascimento and Cruz, 2013). RFlow provides an automatic mechanism that does not require adaptations on the workflow to support the internal provenance gathering mechanism of the R system. RFlow supports the whole life cycle of the in silico scientific experiment; the provenance metadata is captured at two distinct phases in RFlow. At the first phase, the prospective provenance is collected at the setup time of the agronomic research project in the laboratory. RFlow allows researchers to select and configure data files, parameters, metaworkflows, annotations, R scripts, publications and keywords of the experiment through the SisGExp Web interface. At the second phase, the retrospective provenance is collected during the execution of the statistical workflows in the laboratory computational apparatus. Finally, at the third phase, both data results and provenance are delivered as open datasets, they can be browsed and queried by the scientist that executed the in silico experiment or third-parties. RFlow was planned to strike a balance between simplicity and expressivity, with the aim of empowering users, who may have only a basic understanding of R script programming to pack it as scientific workflows. As a first step, the researcher connects to SisGExp interface to define the computational setup of the agronomic experiment, for example, registering hypothesis; selecting existing R scripts that are part of the experiment; inserting experiment descriptors and variable of interest; statistical methods; note-taking annotations about that experiment. As a second step, the researcher must configure the parameters and select the input data files to work with the statistical system and the SWfMS. Such configuration is achieved by the automatic encapsulation of the chosen R script by a generic meta-workflow. The approach takes advantage of the support the provenance gathering mechanism associated with the SWfMS. This step is controlled by the RFlow
and Talburt (2015)) are beyond the scope of this work, there are connections between data quality standards and data provenance. For example, ISO 8000 has a series of standards that address data quality. The standard ISO 8000-120:2016 specifies requirements for capture and exchange of data provenance information (ISO, 2018). However, it does not specify either a complete model for characteristic data or specify an exchange format for characteristic data with provenance information. Thus, to trace the changing information and the provenance of the agronomic in silico experiments over the web we adopted the W3C PROV recommendation to implement and codify our framework. 2.6. Scripting and system R The disseminated use of scripts is real in agricultural science, where research teams develop proprietary scripts. The scripts are not merely shared, in several cases, they become nearly invisible to other researchers and, therefore, is likely to remain underutilized or are eventually lost. Finally, they are seldom distributed together with the publications (Heidorn, 2008; Wallis et al., 2013). R system is an open-source statistical software that offers a broad range of statistical features. Several features make R attractive to agronomic researchers; for instance, it encourages data exploration by providing an online environment that supports a combination of executing scripts and entering of commands in the interactive R console (Spector, 2008; Chambers, 2008; Lerner and Boose, 2015). However, despite its potential, the system was designed by statisticians, for statisticians. Thus, to create R scripts, programming and statistical skills are desirable. Besides, the system still does not have either support to provenance gathering or execution tracking (Runnalls, 2013). 3. RFlow framework Reproducible research is a fundamental responsibility of researchers, but the best practices for achieving it are not well established in many data-driven agronomic research projects. Our contribution is different from the related works and is depicted in Fig. 2. We propose RFlow, an architecture which supports the whole life cycle of the experiment, and the in silico experiments, including input and output datasets, and R scripts, saving all this information in a repository compatible with the W3C PROV recommendation. Besides, our approach does not require any changes in the source code of the R
Fig. 2. Overview of RFlow framework and the representation of a generic meta-workflow. 18
Computers and Electronics in Agriculture 161 (2019) 14–28
S.M.S.d. Cruz and J.A.P.d. Nascimento
collect prospective provenance.
configuration services which automatically read the script specification, make the association a meta-workflow and associate provenance capture mechanism. During these two steps, all the prospective provenance metadata related to that user, the configuration parameters, the input data, experiment design, and other settings are caught by SisGExp and then stored in a provenance repository in the persistence layer represented by the database system. As a third step, a researcher may run and monitor the execution of the meta-workflow hosted on a remote server. The core components of RFlow initiate the execution of the meta-workflow. The retrospective provenance metadata related to the execution of the R script is collected at run-time by the provenance system coupled to SWfMS and transferred to the provenance repository. Once the computational experiment is finished, researchers may browse, visualize, share and analyze both the results and the provenance metadata of the experiments; the researcher will have an integrated repository of data and metadata that can be analyzed through the SisGExp interface services. RFlow is based on the model-view-controller software architectural pattern (Leff and Rayfield, 2001), the framework is essentially a loosely coupled distributed platform composed of a set of integrated software components, each of which has a well-defined purpose. Fig. 2 shows a conceptual overview of them. The main components are meta-workflows, Experiments Management System (SisGExp), the statistical system (R System), the SWfMS (Kepler SWfMS), the provenance collector service and a relational database management system (PostgreSQL). The next subsection describes the components of the architecture.
3.3. SisGExp
The concept of a meta-workflow indicates a high-level abstraction that represents the workflow of the scientific experiment capable of being executed by the SWfMS (Kumar and Wainer, 2005). In this paper, a meta-workflow is a generic and reusable solution to encapsulate R scripts. It acts as a wrapper that captures the script and all its statistical functions as a sequence of activities of a workflow, allowing its implementation in an SWfMS with all the benefits of built-in provenance. A meta-workflow enables the reuse of any existing R scripts in the form of scientific workflows without having to refactor or recode them. The meta-workflows offer advantages and facilities that are absent in the R system, such as execution control, traceability, reproducibility and the collection of retrospective provenance metadata of each execution.
The SisGExp component is a Web application that uses the Java EE technology (http://www.oracle.com/technetwork/java/javaee/tech/ index-jsp-142185.html); it is in charge of registering experiment planning, collecting prospective provenance, invoking the execution of the meta-workflow and monitoring of experimental data collected by the researcher in the field. The architectural style used by the SisGExp was the MVC (Model-View-Controller) because it allows better separation of RFlow layers. SisGExp used the JavaServer Faces with PrimeFaces to implement the Web interface and the implementation of workflow controls. The application server used to support all the specifications of the Java EE API is the GlassFish (https://javaee.github.io/glassfish/). The main features of the SisGExp are described as follows (a) register of the design of the agronomic experiments and its hypothesis; (b) attachment of existing R script and linkage to previously registered experiment; (c) generate statistical results in real time from the execution of the R script; (d) collect retrospective provenance metadata on executions of R scripts encapsulated by a meta-workflow; (e) register of publications (papers, reports, thesis, dissertations) related to the experimentś researches OD; (f) upload, download, and share of R scripts, data files, publications, and meta-workflows as OD and (g) collect prospective provenance metadata of the agronomic experiment. Science requires more than trustful results. It requires verifiability and authentication. Thus, RFlow has three authentication roles; each role has its functionalities and security permissions. The first role allows the system manager to control all software features (setup, manage users, manage results and metadata). Besides, the role allows the registration of previously evaluated meta-workflows and R scripts. Furthermore, it also enables the monitoring of the overall performance of the system and its servers. The second role, allow an individual researcher to register the ownership of the experiments; coordinate, execute and reuse meta-workflows; register the protocols of the experiments; link experiments with publications, handle the input raw experimental data and the provenance metadata and download data products (results). Besides, he can query, share and export all experiments data and its trails of provenance metadata as OD. Finally, the last role allows guest visitors such as journals editors, referees, and general users to access system resources (e.g., papers, technical reports, datasets) to browse and perform ad-hoc queries on shared experimentś data and its provenance metadata.
3.2. ExecScript
3.4. Provenance repository
The ExecScript is the workflow developed in Kepler System (version 2.5) that represents the concept of meta-workflow which encapsulates the R scripts, offering to the script all the resources provided by Kepler System. Figs. 3 and 4 illustrate the meta-workflow that has been developed to wrap the scripts within the scope of this study. The ExecScript is invoked remotely by SisGExp and enacted by Kepler System. The workflow consists of several actors and file connectors, R specific actors and a composite sub-workflow actor (Fig. 3) and the SDF director that orchestrates the execution of the actors. Among the actors, we highlight the “Provenance Recorder” (PR) (Altintas et al., 2006) that is used to configure and collect the retrospective provenance during the execution of the concrete workflow and stored it directly in the repository (PostgreSQL database). The PR also allow us to capture runtime errors as soon as they occur. The second most important actor is the sub-workflow “subExecScript”, it was modeled as a Kepler composite actor (Fig. 4). The sub-workflow is responsible for the connection between the database that stores the provenance (provenance repository) and the SWfMS that collect retrospective provenance and also linking SisGExp that
The Provenance repository (ProvRep) is the warehouse of experiments' data and provenance metadata. The repository considers the core elements of PROV recommendation. It provides accounts to registered users (visitors and researchers) allowing them to store and upload experiments and share results and provenance either privately or publicly in relational representation. By default, files submitted to ProvRep are private and can only be accessed by their owners. However, researchers can choose to share their files and results with others in two ways: making an OD resource publicly available (e.g., a script, workflow or document) to any SisGExp guest visitor, or sharing it with specific SisGExṕs registered users. The former is useful for the users who want to expose their experiments and its provenance to the public. In the latter, different access profiles can be set to authorized users for access control (e.g., system administrators, editors, referees, or readers). Except for reader, all other roles, and the owner can append new annotations (prospective provenance) to a resource or dataset after it has been created. It is suitable for sharing provenance between a collaborating team of humans and applications. We use PostgreSQL, a mature, open-source and widely used relational database management systems well known for its robustness and
3.1. Meta-workflows
19
Computers and Electronics in Agriculture 161 (2019) 14–28
S.M.S.d. Cruz and J.A.P.d. Nascimento
Fig. 3. Meta-workflow ExecScript (screenshot extracted from Kepler system).
3.5. Provenance schemas
scalability as our relational back-end. Besides, PostgreSQL fully supports SQL queries, allowing us to access, index and model provenance schemas using PROV recommendation. PostgreSQL can comfortably handle the expected capacity of the RFlow data and may maintain provenance compliant schemas. The repository stores prospective and retrospective provenance metadata and also experiments' data. Prospective provenance is generated through the interaction of the researcher with SisGExp during the phases of planning and monitoring of the experiment. The retrospective provenance is collected during the generation of the experimentś results and are automatically captured by the SWfMS.
Relational modeling is mature and well understood. Thus, we modeled entity–relationship diagrams for RFlow, translating it into relational database schemas to storage the prospective and retrospective provenance and the dataset of the agronomic experiments. The provenance repository of RFlow data is straightforward, using two shared schemas public and expdata that can be used for storing, and querying the PROV compliant provenance metadata. The public schema stores retrospective provenance collected by the Kepler System. In this schema, we can identify a series of tables that represent the workflow structure. The schema consists of 17 tables, where data and retrospective metadata from the execution of the
Fig. 4. SubWorkflow of ExecScript depicted by subExecScript (screenshot extracted from Kepler system). 20
Computers and Electronics in Agriculture 161 (2019) 14–28
S.M.S.d. Cruz and J.A.P.d. Nascimento
Fig. 5. Conceptual view of RFLow provenance expdata schema enriched with domain-specific data. Noncritical class fields and tables are not shown.
and livestock to overcome the barriers that limited the production of food, fiber, and fuels (EMBRAPA, 2017).
scientific workflow are stored. The tables are created during the installation and configuration Provenance Recorder actor. The expdata schema (Fig. 5) is associated with prospective provenance; it was planned and implemented to meet the demands of agronomic experiments. It consists of 11 tables compliant with the three core elements of PROV recommendation and also fully compatible with the public schema. It allows researchers to catalog the rationale of the experiment (design, variables, parameters, methods, scripts, users, publications, executions, and annotations) that otherwise cannot be captured either by R scripts or by any SWfMS. The complete description of the schemas, with all tables and detailed fields descriptions, is available at http://labbd.ufrrj.br/RFlow/COMPAG.rar. In the expdata schema, we can identify a set of tables that represent the structure of the in silico experiment. The tables and its classification according to the PROV recommendation are indicated in parenthesis. Tables “experiment” (PROV:entity) stores metadata descriptors about the agronomic experiment, the tables “Experimental_Design” (PROV:entity) and “Variable” (PROV:entity) stores metadata about the design of the experimental activities of the experiment and parameters of the meta-workflow respectively. The table that stores the R script is named “Script” (PROV:activity); the table that store the scripts encapsulated by the meta-workflows. The table “wf_script” (PROV:activity) and “Statistical_Method” describe the statistical analysis used in the agronomic experiment. The table “Data” (PROV:entity) is able to store domain-specific research data. Table “User” (PROV:agent) stores metadata about SisGExpt users and its profiles. Finally, the table “Scientific_paper” (PROV:entity) stores metadata about the publications or datasets associated with the agronomic experiment. The table Public:Workflow_Exec (blue color) belongs to the public schema and is used to link both schemas.
4.1. Experimental scenario and scripts Several laboratories of Embrapa faces challenges in the design of its agronomic experiments. First, researchers often work in isolated computing environments where they put together a patchwork of ad-hoc scripts coded in multiple programming languages, interfacing with a mix of third-party libraries and executables from disparate sources. Second, the documentation and integration of scripts containing thousands of lines of code (thrown together for one-off tasks to complex combinations of packages) are out of the scope of agronomic project goals. Third, researchers have difficulties to the generate a compendium that encompasses all the input data, scripts, computing environments, executables, and parameters needed to reproduce their agronomic experiments properly. However, like any other researcher, their scientific results are often loosely described as tables, plots and figure captions included in peer-reviewed scientific publications. According to Chambers (2008), an experiment based on R script is merely a text file containing the same commands that one would enter on the command line of R environment. The author organized R scripts as: (i) Category I – It is the simplest one, the R scripts use internal parameters or datasets arranged inside the code of the script; (ii) Category II – R scripts do not contain dataset; the script has functions to read external data locally stored as CSV files or spreadsheets; (iii) Category III – R scripts do not contain dataset; the script has functions to read external data remotely stored on the Web, remote servers or databases servers
4. Case study: Relating provenance and agronomic experimental data
Nascimento (2015) did a systematic review of about 110 of R scripts used at EMBRAPA Agrobiology. The scripts were developed in five years (from 2010 to 2015); he categorized each one according to Chamberś classification. The author estimates that about 90% of the scripts can be classified as category II and 10% as category I. He could
In this section, we present RFlow; a real scenario, and a case study with three computational in silico experiments based on official Brazilian Agronomic Research Corporation (Embrapa). Embrapás mission is to develop a genuinely Brazilian model of tropical agriculture 21
Computers and Electronics in Agriculture 161 (2019) 14–28
S.M.S.d. Cruz and J.A.P.d. Nascimento
not find any script of category III. To evaluate the provenance metadata gathering approach, and query the provenance metadata of the experiments, with as reduced experimenter bias as possible; we developed three computational experiments with statistical workflows compatible with each category. Furthermore, to avoid conflicts of interest, we selected distinct samples of R scripts (R version 2.10.1) to execute these experiments, the scripts can be considered as either property of Embrapa Agrobiology or external R scripts supplied by public domain experiments. Finally, each R script belongs to distinct research domains. Our experimental setup is as follows. For our test platform, we used a 2-node Intel 2.4 GHz Xeon cores per node, 8 GB per node, connected via a 10 Gbps Ethernet and 100 TB storage to run Kepler SWfMS (version 2.5) and PostgreSQL (version 9.2) and Linux Red Hat (Enterprise Edtion, version 6.6) a notebook Intel Core 7, 4 GB to remotely execute RFlow and SisGExp web Interface. The goal of our experimental evaluation is to investigate the usage of RFlow and examine if the provenance generation of the scripts using meta-workflows can be executed, stored in the database and further queried using SQL. Besides, we stress that although the statistical analysis algorithms used in the in silico experiments are relatively simple, they show how the researcher can take advantage of the integration of provenance metadata and with the scientific data generated by the experiments. The following sub-section describe and analyses the results of three experiments.
execute calculations and generate a set of graphical outputs. To illustrate how we can leverage integrated prospective and retrospective provenances, we use an R script that calculates the ANOVA statistics of a soil fertility experiment (Fig. 7). The script reads one data file which contains experimental data from soil experiments conducted by Embrapa Agrobiology. The script has three activities to calculates the mean values of several variables and create scatterplots of these interactions and calculates ANOVA. The outputs of the meta-workflow look like the graphics depicted in Fig. 8. The experiment evaluates the percentage of Nitrogen in the soil. It uses the completely randomized design approach and considers four treatments, four repetitions, and three samples. The upper graph (Fig. 8a and b) shows the boxplot of the values of the residues of sediments by treatment and material respectively. Fig. 8c illustrates the dispersion of the residuals of the model against the data predicted by the model. Fig. 8d shows the boxplot of the standardized residues. Finally, Fig. 8e displays the normal probability against residues obtained between the experimental data and the values predicted by the empirical statistical model. 4.4. Using RFlow to browse data and metadata of the in silico experiments The RFlow was developed to use the provenance schemas described in previous sections. By executing the three computational experiments within encapsulated R scripts, we were able to register the prospective provenance of each experiment and also collect a variety of retrospective provenance metadata of its results. For example, Fig. 9 is a screenshot of SisGExp where we highlight two dot-line boxes. The upper box indicates the menu bar and the lower box shows a grid panel. It presents several data and provenance metadata about the three experiments we mentioned in the previous subsections. The menu bar contains drop-down menus adjustable to the security role of the logged user. Its purpose is to supply specific menus which provide access to functions such as register an agronomic experiment and the protocol used in the agronomic experiment; upload original raw data and R Scripts; reproduce previously registered in silico experiments; query the provenance of results and download the scripts and results. At the grid, a registered user can browse each experiment; he can see annotations (prospective provenance metadata) about them. For instance, the first line of the grid panel presents the metadata from an agronomic experiment of category II. In this case, a user can browse the metadata of the agronomic experiment (name of the experiment, author, dataset used, when and where original agronomic raw data were collected).
4.2. Category I – Neural network experiment The first computational experiment consists of the encapsulation of simplest category of R scripts by the ExecScript meta-workflow. We used internal datasets arranged inside the code of the script. The script uses the ‘neuralnet’ R library to train and build a multilayer perceptron neural network. It takes a single input (square rooting number) and produces an output (the square root of the input). Fig. 6 contains ten hidden neurons which are trained. The outputs of the meta-workflow will look like Fig. 6. The neural network finds the square root and, the prevalent error the finding the square root of 1 is out by ∼4%. 4.3. Category II – Agrobiology experiment The second computational experiment consists of the encapsulation of a proprietary R script that belongs to category II, the most used at EMBRAPA Agrobiology. The script wrapped by the ExecScript metaworkflow. The script read external data locally stored as spreadsheets,
Input
Expected Output
Neural Net Output
1
1
0.9984154698
4
2
2.0016716739
9
3
2.9982157822
16
4
3.9987940986
25
5
4.9949359964
36
6
6.0078252585
Fig. 6. Example of the output data of the experiment of the first category generated by RFlow. 22
Computers and Electronics in Agriculture 161 (2019) 14–28
S.M.S.d. Cruz and J.A.P.d. Nascimento
arquivo = read.table("datSoil10-03-2015.xlt", h=T) #---- Activity 1--- Importing Excel spreadsheet file arquivo
# ---- Activity 1--- Reading file in R
dim(arquivo) names(arquivo)
# ---- Activity 1--- Naming Columns
attach(arquivo)
# ---- Activity 1--- Attaching file
is.factor(Trata)
# ---- Activity 2 --- Factoring
is.factor(Mat) is.numeric(PorcN)
# ---- Activity 2 --- Numerical Variable
arquivo.m <- tapply(PorcN, list(Trata,Mat), mean) # ---- Activity 2 --- Calculating interaction´s means arquivo.m
# ---- Activity 2 --- Releasing interaction´s means
arquivo.mt <- tapply(PorcN, Trata, mean)
# ---- Activity 2 --- Calculating general mean
arquivo.mt arquivo.mm <- tapply(PorcN, Mat, mean) arquivo.mm
# ---- Activity 2 --- Releasing general mean
par(mfrow=c(1,2))
# ---- Activity 3 --- 2D - Dimensional graphics
interaction.plot(Trata, Mat, PorcN)
# ---- Activity 3 --- 2D - Dimensional graphics
interaction.plot(Mat, Trata, PorcN)
# ---- Activity 3 --- 2D - Dimensional graphics
arquivo.av <- aov(PorcN ~ Trata + Mat + Trata * Mat)
# ---- Activity 3 --- Calculating Anava - DIC Factorial
arquivo.av <- aov(PorcN ~ Trata * Mat) summary(arquivo.av)
# ---- Activity 3 --- Releasing Anava Fig. 7. An example of category II script – ANOVA statistics.
The last column of the grid panel shows SisGExp commands “list” and “alter”. If one selects the “list” command, the grid panel loads all the retrospective provenance metadata collected by the RFlow (Fig. 10), showing the name of the experiments, its owner, and the workflow executor, the scripts and files used, the status of the execution and the timestamp of the experiment. In Fig. 9 we highlight two dashed boxes. The upper box (A) indicates the menu bar of SisGExp and the lower box (B) shows a grid panel. The grid presents several data and provenance about the two trials we mentioned in the previous subsections.
The queries are based on the provenance metadata and domain-specific data. Queries presented were inspired by the queries of the Provenance Challenges (ProvChallenge, 2010). A Provenance Challenge is the official and traditional event of the provenance research community which aims to qualify or quantify the expressive power and compare the capabilities of existing provenance-related systems, in, examine the representations that systems use to document details of processes that have occurred and the ability of each system to answer provenancerelated queries. The provenance-related queries we developed were mapped as SQL transactional queries that generate datasets that can describe rationale of the in silico experiment. It is important to highlight the queries (Q1, Q2, and Q3) are related to either prospective, and retrospective provenance metadata of the three in silico experiments described the previous sections considers the W3C PROV recommendation; they illustrate how provenance metadata can enrich domain-specific data. We stress that the existing related works cannot run the proposed SQL queries because they are neither able to collect the same types of provenance collected by RFlow nor able to offer such knowledge as OD.
4.5. Quantifying and qualifying experiment́s data and its provenance metadata Our case study shows how the researcher can exploit the integration of provenance metadata and domain data to execute queries that otherwise would not be able to be performed. In other words, this subsection shows the expressive power of our approach, showing that it supports a wide variety of questions that typically arise when researchers are setting and conducting agronomic. For instance, the researcher can query the data generated by the experiments and its metadata. Besides that, the approach supports the full lifecycle of in silico experiments described in Section 2. In this subsection, we present a set of three SQL queries (Q1, Q2, and Q3) how researchers can qualify the advantages of using RFlow.
4.5.1. Querying prospective provenance The first query (Q1) is used to quantify, select and extract prospective metadata captured by SisGExp and stored in expdata provenance schema; such metadata was collected during the first phase of the life cycle of the experiment described in Section 2.2. 23
Computers and Electronics in Agriculture 161 (2019) 14–28
S.M.S.d. Cruz and J.A.P.d. Nascimento
Fig. 8. Graphical outputs of an experiment of category II generated by RFlow.
This kind of metadata is collected during the first phase of the life cycle of the experiment. The prospective metadata is used for determining which in silico agronomic experiments registered in RFlow during the year of 2017 produced scientific results which appeared in
scientific publications (e.g., journal). Q1 must return the name of the researcher which did the experiment I in the field, the goal of the agronomic experiment, its duration and the publications that are related to it. This way, in our provenance repository, Q1 is modeled as the 24
Computers and Electronics in Agriculture 161 (2019) 14–28
S.M.S.d. Cruz and J.A.P.d. Nascimento
Fig. 9. Screenshot of the list of registered agronomic experiments in SisGExp.
Fig. 10. Screenshot with experimental results of each implementation of the R scripts in SisGExp.
workflows (R scripts) were executed by whom during a given time, showing which annotations (about the workflow executions) were inserted by the executor and the final status of each workflow run and errors messages if they were present. Query Q2 can be performed solely after the executions of the computational experiments. It is important to highlight that Q2 we have to consider all the executions of the three experiments that were encapsulated by the ExecScript meta-workflow that presented (or not) execution errors (based on provenance metadata). This way, in our provenance repository, Q2 is modeled as the following SQL statement:
following SQL statement: SELECT usu.nome, exp.nome, exp.dt_inicial_instalacao, exp.dt_final_instalacao, trab.titulo FROM expdados.experimento exp, expdados.usuario usu, expdados.trabalho_cientifico trab WHERE (exp.id_responsavel = usu.id) AND (exp.id = trab.id_experimento) AND exp.dt_inicial_instalacao > = '2017-01-01 00:00:00′ AND exp.dt_inicial_instalacao < = '2017-10-30 00:00:00′;
SELECT w.“name”, exe.“user”, exe.annotation, exe.“type”, exe.start_time, exe.end_time, e.message FROM public.workflow w, public.workflow_exec exe FULL JOIN public.error e ON exe.id = e.exec_id WHERE w.id = exe.wf_id AND exe.start_time > = '2017-01-01 00:00:00′ AND exe.end_time < = '2017-10-30 00:00:00′;
4.5.2. Querying retrospective provenance The second query (Q2) aims to quantify, select and return retrospective metadata captured by Provenance Recorder and stored in public provenance schema, such metadata was collected during the second phase of the life cycle of the experiment described in Section 2.2. This kind of metadata is collected during the second phase of the life cycle of the experiment. Q2 is used to determine which statistical meta25
Computers and Electronics in Agriculture 161 (2019) 14–28
S.M.S.d. Cruz and J.A.P.d. Nascimento
provenance standards like W3C Prov or ProvONE (Cuevas-Vicenttín et al., 2014) and if it is capable of executing R scripts. A new tool called YesWorkflow has just been presented by (McPhillips et al., 2015). It aims to provide researchers that are using scripting languages some of the benefits of scientific workflow systems. YesWorkflow does not require the use of a workflow engine but requires adjustments to the source code of the script to run effectively in such a system. YesWorkflow requires researchers to add annotations (based on keywords) on existing scripts coded in Perl, Python, R, or MATLAB with special comments that reveal the computational modules and data flows. According to the authors, the YesWorkflow tools complements noWorkflow tool by revealing prospective provenance of the scripts. Similar noWorkflow, no hints if the database schema is compatible with provenance standards were provided. Taverna is a traditional workflow engine used in the biological sciences which support R scripts thought a plugin named RShell (Wassink et al., 2009). To support scripts, the plugin directly uses the R interpreter. It consists of a Taverna processor for R scripts and an RShell Session Manager that communicates with the R server. However, we stress that the plugin cannot collect prospective provenance of the workflows the engine executes. RFlow distinguishes from related work because it allows peer validation of the data collected from field experiments by reproducing their R scripts that generate statistical results while collecting two types of provenance metadata. It allows, researcher interaction through the internet, using a Web platform is hiding operational details about SWfMS configuration and technicalities of harvesting of provenance metadata and sharing the data as OD through SisGExp interface.
4.5.3. Querying retrospective and retrospective provenance The third query (Q3) aims to quantify, select and extract both kind of provenance (prospective and retrospective metadata) captured either by SisGExp and Provenance Recorder and stored both in public and expdata provenance schemas. This kind of metadata is collected during the three phases of the life cycle of the experiment described in Section 5.2. Q3 is used to determine which researcher conducted an agronomic experiment which used and script named ANOVA and produced annotations, and which required meta-workflows to wrap existing R scripts during a given time. Query Q3 can be performed solely after the registering and configuration of the agronomic experiment and executions of each execution of the computational experiment. It is significant to highlight that Q3 we must consider all the executions of all experiments that were encapsulated by the ExecScript meta-workflow. This way, in our provenance repository, Q3 is modeled as the following SQL statement: SELECT usu.nome, exp.nome, exe.“type”, exe.annotation, exe.start_time, exe.end_time, e.message FROM public.workflow w, expdados.wf_script wf, expdados.script sc, expdados.experimento exp, expdados.usuario usu, public.workflow_exec exe full join public.error e on (exe.id = e.exec_id) WHERE (exe.id = wf.id_workflow_exec) AND (wf.id_script = sc.id) AND (sc.id_experimento = exp.id) AND (exp.id_responsavel = usu.id) AND exe.start_time > = '2017-01-01 00:00:00′ AND exe.end_time < = '2017-10-30 00:00:00′; AND script =́ ANOVA’.
6. Conclusion Good science requires proper documentation, curation, and reproducibility, but the effort to achieve this has been significant. Thus, to reduce this gap, we have developed RFlow. It is an approach that empowers researchers to reproduce, reuse and encapsulate existing R scripts as statistical scientific workflows. It collects and stores agronomic experimentś data together with different kinds of provenance metadata, making them useful and publicly accessible as OD to other researchers or third-party. To the best of our knowledge, few works face the challenges of gathering provenance from the whole life cycle of the in silico agronomic experiments. Compared to related works, the main advantages of RFlow are (a) it is entirely transparent—R scripts are fully preserved, researchers and third-party users do not need to instrument or change their codes; (b) it systematically captures two types of provenance (prospective and retrospective) of the in silico experiments; (c) it does not require researchers to change their manner of working, scripts are wrapped by generic meta-workflows that can be enacted at SWfMS and controlled via SisGExp; (d) it enhances computational reproducibility of agronomic experiments, because the proposed database schemas can register either raw data and metadata that can be preserved (after the end of the experiment) or reused, shared, browsed later by peers, and (e) it enables the querying of provenance efficiently and based on consolidated database and provenance standards. Although our evaluation has shown that RFlow is adequate for a range of STE experiments in different areas of agriculture and the overhead is not burdensome, there are some known limitations. For instance, according to the Research Data Alliance (https://www.rdalliance.org/), there is great concern about the lack of standards of how to collect and register agronomic data. For such situations, our current approach is to adapt the expdata schema to future standards. There is still much computational work needed to be developed in agricultural sciences and more in-depth studies to understand the role of data provenance in Agriculture. As future work, we plan to incorporate FAIR data principles (Wilkinson et al., 2016) for scientific data management to improve automatic data discoverability, data
5. Related work Currently, the state-of-the-art investigates the transparent collection of provenance metadata of workflows. For instance, various tools have been proposed to capture the retrospective provenance of R scripts. However, these studies present solutions in the opposite direction of our proposal. Several alternatives that aimed to collect provenance of R script are growing in recent years. Some authors advocate the incorporation of provenance gathering to statistical systems. For instance, (Silles and Runnalls, 2010; Runnalls, 2013), propose a refactoring of the R interpreter to include provenance of facilities in the R execution engine. They indeed presented a variant of R named CXXR. The system offers some features, such as execution log files to ease the collection of retrospective provenance. However, as far as we are concerned, it does not provide several facilities that exist in a current SWfMS. Furthermore, their approach also requires changes in the source code of the scripts, embedding specific commands to track retrospective provenance only. At this time, several research proposals in the scientific literature investigate novel ways to capture provenance metadata of other scripts languages than R. For example, Bochner et al. (2008) proposed an API and a library to capture and query the retrospective provenance metadata of Python scripts. ProvenanceCurious is another tool that can infer data provenance from Python scripts. It uses abstract syntax tree (AST) analysis, and it uses a graph to provide query capabilities (Huq et al., 2013). Both works are unable to reuse R scripts. Murta et al. (2015) presented the noWorkflow tool which is also used AST and can collect retrospective provenance of Python scripts transparently but requires modifications to the script. It uses Python runtime profiling functions to generate provenance traces that reflect the processing history of a given script. NoWorkflow stores the provenance in a structured database. However, to the best of our knowledge, the tool captures and stores retrospective provenance only. Besides, the research presents no hints if the database schema is compatible with 26
Computers and Electronics in Agriculture 161 (2019) 14–28
S.M.S.d. Cruz and J.A.P.d. Nascimento
opportunities. Proceeding of SIGMOD́ 08. pp. 1345–1350. http://doi.org/doi: 10. 1145/1376616.1376772l. Deelman, E., Gannon, D., Shields, M., Taylor, I., 2009. Workflows and e-Science: an overview of workflow system features and capabilities. Future Gener. Comput. Syst. 25 (5), 528–540. http://doi.org/doi:10.1016/j.future.2008.06.012. Driemeier, C.E., Ling, Liu Yi, Pontes, A.O., Sanches, G.M., Franco, H.C.J., Magalhaes, P.S.G., Ferreira, J.E., 2014. Data analysis workflow for experiments in sugarcane precision agriculture. In: 10th International Conference on eSciencé14, pp. 163–168. https://doi.org/10.1109/eScience. 2014.10. Dou, L., Cao, G., Morris, P.J., Morris, R.A., Ludäscher, B., Macklind, J.A., Hankenc, J., 2012. Kurator: a kepler package for data curation workflows. Procedia Comput. Sci. 6 (9), 1614–1619. https://doi.org/10.1016/j.procs.2012.04.177. EMBRAPA., 2017. Brazilian Agricultural Research Corporation. https://www.embrapa. br/en. Freire, J., Koop, D., Santos, E., Silva, C.T., 2008. Provenance for computational tasks: a survey. Comput. Sci. Eng. 10, 11–21. http://doi.ieeecomputersociety.org/10.1109/ MCSE.2008.79. Fomel, S., Claerbout, J.F., 2009. Reproductible research. Comput. Sci. Eng. 11 (5). https://doi.org/10.1109/MCSE.2009.14. Gabriel, A., Capone, R., 2011. Executable paper grand challenge workshop. Procedia Comput. Sci. 4, 577–578. https://doi.org/10.1016/j.procs.2011.04.060. Gandrud, C., 2015. Reproducible Research with R and R Studio, 2. Chapman and Hall/ CRC, USA. GODAN. 2017. Global Open Data for Agriculture and Nutrition. http://www.godan.info/. Godfray, H.C., Beddington, J.R., Crute, I.R., Haddad, L., Lawrence, D., Muir, J.F., 2010. Food security: the challenge of feeding 9 billion people. Science 327 (5967), 812–818. https://doi.org/10.1126/science.1185383. Hartig, O., Zhao, J., 2010. Publishing and consuming provenance metadata on the Web of Linked Data. Lect. Notes Comput. Sci. 6873, 78–90. https://doi.org/10.1007/978-3642-17819-1_10. Heidorn, P.B., 2008. Shedding light on the dark data in the long tail of science. Library Trends 5 (2), 280–299. Hey, T., Tansley, S., Tolle, K., 2009. The Fourth Paradigm: Data-Intensive Scientific Discovery, 1. Microsoft Research, USA. Huq, M.R., Apers, P.M.G., Wombacher, A., 2013. ProvenanceCurious: a tool to infer data provenance from scripts. Proceedings of the EDBT́ 13, pp. 765–768. http://doi.org/ 10.1145/2452376.2452475. Ioanidis, J.P., 2005. Why most published research findings are false. PLoS Med. 2 (8), e124. http://doi.org/doi:10.1371/journal.pmed.0020124. ISO, International Organization for Standardization. 2018. ISO 8000-120:2016. https:// www.iso.org/standard/62393.html. Kashlev, A., Lu, Shiyong, 2014. A system architecture for running big data workflows in the cloud. In: IEEE International Conference on SCĆ14, pp. 51–58. https://doi.org/ 10.1109/SCC.2014.16. Katz, D.S., Zhang, Z., 2014. Special issue on eScience infrastructure and applications. Future Gener. Comput. Sci. 6, 335–337. https://doi.org/10.1016/j.future.2014.03. 007. Kohler, R.E., 2002. Landscapes and Labscapes. 1. The. University of Chicago Press, Chicago. Kumar, A., Wainer, J., 2005. Meta workflows as a control and coordination mechanism for exception handling in workflow systems. Web Serv. Process Manage. 40 (1), 89–105. https://doi.org/10.1016/j.dss.2004.04.006. Lerner, B., Boose, E., 2015. RDataTracker and DDG explorer. In: Proceeding of TAPṔ15, pp. 288–290. https://doi.org/10.1007/978-3-319-16462-5_36. Leff, A., Rayfield, J.T., 2001. Web-application development using the model/view/controller design pattern. In: IEEE Enterprise Distributed Object Computing Conference, pp. 118–127. https://doi.org/10.1109/EDOC.2001.950428. LeVeque, R.J., Mitchell, I.M., Stodden, V., 2012. Reproducible research for scientific computing: tools and strategies for changing the culture. Comput. Sci. Eng. 4, 13–17. http://doi.ieeecomputersociety.org/10.1109/MCSE.2012.38. Liu, J., Pacitti, E., Valduriez, P., Mattoso, M., 2014. A survey of data-intensive scientific workflow management. J. Grid Comput. 13 (4), 457–493. Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Zhao, Y., 2006. Scientific workflow management and the Kepler system: research articles. Concurr. Comput.: Practice Exper. 18 (10), 1039–1065. Matt, H., 2011. The history and future of agricultural experiments. Wageningen J. Life Sci. 57 (3–4), 187–195. https://doi.org/10.1016/j.njas.2010.11.001. Mattoso, M., Werner, C., Travassos, G., Braganholo, V., Ogasawara, E., Oliveira, D., Cruz, S.M.S., Martinho, W., Murta, L., 2010. Towards supporting the life cycle of large scale scientific experiments. InInt. J. Bus. Integr. Manage. 5, 79–82. Martib, A., Lyle, J., Namilkuo, C., 2012. Provenance as a Security Control. In: 4th USENIX Conference on Theory and Practice of Provenance PAPṔ12, pp. 1–4. McPhillips, T., Bowers, S., Zinn, D., Ludascher, B., 2009. Scientific workflow design for mere mortals. Future Gener. Comput. Syst. 25 (5), 541–551. http://doi.org/doi:10. 1016/j.future.2008.06.013. McPhillips, T., Song, T., Kolisnik, T., 2015. YesWorkflow: a user-oriented, language-independent tool for recovering workflow information from scripts. J. Digit. Curat. 10 (1), 298–313. http://doi.org/doi:10.2218/ijdc.v10i1.370. Mervis, J., 2016. NSF breaks new ground in reprimanding authors of flawed Science paper. Science. https://doi.org/10.1126/science.aae0313. Mesirov, J.P., 2010. Accessible reproducible research. Sceince 327 (5964), 415–416. http://doi.org/doi:10.1126/science.1179653. Michener, W.K., 2015. Ecological data sharing. Ecological Informatics, 29, Part 1. pp. 33–44. http://doi.org/10.1016/j.ecoinf.2015.06.010. Mitchell, I.M., LeVeque, R.J., Stodden, V., 2012. Reproducible research for scientific computing: tools and strategies for changing the culture. Comput. Sci. Eng. 14 (4),
sharing, data curation, and long-term data stewardship policies. Besides, we plan to evaluate the use OpenSoils database (Cruz et al., 2018b) in our framework. Acknowledgments This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) Finance Code 001. This work was partially sponsored by the Brazilian National Fund to Develop Education (FNDE), Educational Tutorial Program (PET-SI/UFRRJ), EMBRAPA Agrobiology and, Carlos Chagas Filho Research Foundation (FAPERJ). We also thank CYTED Networks – BigDSSAgro and SmartLogistics@IB. Appendix A. Supplementary material Supplementary data to this article can be found online at https:// doi.org/10.1016/j.compag.2019.01.044. References AIMS, 2017. Agricultural Information Management Standards. Aldeco-Pérez, R., Moreau, L., 2010. A provenance-based compliance framework. FIŚ10. In: Conference on Future Internet, pp. 128–137. https://doi.org/10.1007/978-3-64215877-3_14. Altintas, I., Barney, O., Jaeger-Frank, E., 2006. Provenance collection support in the kepler scientific workflow system. Lect. Notes Comput. Sci. 4145, 118–132. https:// doi.org/10.1007/11890850_14. Baker, M., 2016. 1,500 scientists lift the lid on reproducibility. Nature 533 (7604), 452–454. https://doi.org/10.1038/533452a. Bochner, C., Gude, R., Schreiber, A., 2008. A python library for provenance recording and querying. Lect. Notes Comput. Sci. 5272, 229–240. https://doi.org/10.1007/978-3540-89965-5_24. Borgman, C.L., 2012. The conundrum of sharing research data. Adv. Inform. Sci. 63 (6), 1059–1078. https://doi.org/doi:10.1002/asi.22634. Bowers, S., McPhillips, T.M., Ludascher, B., 2008. Provenance in collection-oriented scientific workflows. Concurr. Comput.: Practice Exper. 20 (5), 519–529. Brooks, H., Sugden, A., Alberts, B., 2011. Making data maximally available. Science 331 (6018), 64. https://doi.org/10.1126/science.1203354. Buneman, P., Khanna, S., Tan, W.-C., 2000. Data provenance: some basic issues. Lect. Notes Comput. Sci. 1974, 87–93. https://doi.org/10.1007/3-540-44450-5_6. Buneman, P., Davidson, S., Frew, J., 2016. Why data citation is a computational problem. Commun. ACM 59 (9), 50–57. https://doi.org/10.1145/2893181. Callahan, S.P., Freire, J., Santos, E., Scheidegger, C.E., Silva, C.T., Vo, H.T., 2006. VisTrails: visualization meets data management. In: Proceedings of the SIGMOD-06, pp. 745–747. http://doi:10.1145/1142473.1142574. CGIAR, 2017. Science for a food-secure future. http://www.cgiar.org. Chambers, J., 2008. Software for Data Analysis, 1. Springer, Berlin/Heidelberg. Cho, Y., Moon, J., Yoe, H., 2010. A context-aware service model based on workflows for u-agriculture. In: Proceedings of the ICCSÁ10, pp. 258–268. https://doi.org/10. 1007/978-3-642-12179-1_23. CODATA, 2017. Agriculture Data, Knowledge for Learning and Innovation. http://www. codata.org/task-groups/agriculture-data. Cuevas-Vicenttín, V., Kianmajd, P., Ludäscher, B., Missier, P., Chirigati, F., Wei, Y., Dey, S., 2014. The pbase scientific workflow provenance repository. Int. J. Digit. Curat. 9 (2), 28–38. http://doi.org/doi:10.2218/ijdc.v9i2.332. Crawford, M., Sonogan, R., Unkovich, M., Yunusa, I., 2003. Review of Long-term Agronomic Experiments. 1. Department of Natural Resources & Environment, Victoria/Australia. Cruz, S.M.S., Campos, M.L.M., Mattoso, M., 2009. Towards a taxonomy of provenance in scientific workflow management systems. In: Proceedings of the Congress on Serviceś09, pp. 259–266. http://doi.org/doi:0.1109/services-i.2009.18. Cruz, S.M.S., Costa, R.J.M., Manhães, M., Zavaleta, J., 2013. Monitoring SOA-based applications with business provenance. In: SAĆ13. Symposium on Applied Computing, pp. 1927–1932. https://doi.org/10.1145/2480362.2480718. Cruz, S.M.S., Nascimento, J.A.P., 2016. SisGExp: rethinking long-tail agronomic experiments. In: Mattoso, M.; Glavic, B. (Eds.). Provenance and annotation of data and processes. Berlin: Springer, 2016. Proceedings of the 6th International Provenance and Annotation Workshop, IPAW 2016, held in McLean, VA, USA. Cruz, S.M.S., Ceddia, M.B., Miranda, R.C.T, Rizzo, G.S., Klinger, F., Cerceau, R., Mesquita, R., Cerceau, R., Marinho, E.C., Schmitz, E.A., Sigette, E., Cruz, P.V., 2018a. Data provenance in agriculture. In: Alper, P., Belhajjame, K. (Eds.). Provenance and annotation of data and processes. Berlin: Springer, 2018. Proceedings of the 7th International Provenance and Annotation Workshop, IPAW 2018, held in Kings College, London, UK. http://doi.org/10.1007/978-3-319-98379-0_31. Cruz, S.M.S, Ceddia, M.B., Cruz, P.V.C., et al., 2018b. Towards an e-infrastructure for Open Science in Soils Security. XII Brazilian Workshop o E-Science. XXXVIII Congresso Sociedade Brasileira da Computação, Natal, Brazil. Davidson, S.B., Freire, J., 2008. Provenance and scientific workflows: challenges and
27
Computers and Electronics in Agriculture 161 (2019) 14–28
S.M.S.d. Cruz and J.A.P.d. Nascimento
https://doi.org/10.1002/wics.1251. Sandve, G.K., Nekrutenko, A., Taylor, J., Hovig, E., 2013. Ten simple rules for reproducible computational research. PLoS Comput. Biol. 9 (10), e1003285. https:// doi.org/10.1371/journal.pcbi.1003285. Silles, C., Runnalls, A., 2010. Provenance-Awareness in R. Proceeding of IPAẂ10. 6378. pp. 64–72. https://doi.org/10.1007/978-3-642-17819-1_8. Simmhan, Y.L., Plale, B., Gannon, D., 2005. A survey of data provenance in e-science. SIGMOD Rec. 34 (3), 31–36. Spector, P., 2008. Data Manipulation with R, 1. Springer, Berlin/Heidelberg. Stodden, V., Leisc, F., Peng, R.D., 2014. Implementing Reproducible Research, 1. Chapman and Hall/CRC, USA. Travassos, G.H., Barros, M., 2004. Contributions of In Virtuo and In Silico Experiments for the Future of Empirical Studies in Software Engineering. In: Proceedings of 2nd workshop on empirical software engineering the future of empirical studies in software engineering, Fraunhofer IRB Verlag, Roman Castles, Italy. Tuot, C.J., Sintek, M., Dengel, A.R., 2008. IVIP – A Scientific Workflow System to Support Experts in Spatial Planning of Crop Production. SSDBḾ08. Proceeding of. pp. 586–591. https://doi.org/10.1007/978-3-540-69497-7_42. van Evert, F.K., Spaans, E.J.A., Krieger, S.D., Carlis, J.V., Baker, J.M., 2008. A database for agroecological. Res. Data: I. Data Model. Agron. J. 91 (1), 54–62. https://doi.org/ 10.2134/agronj1999.00021962009100010009x. Wallis, J.C., Rolando, E., Borgman, C.L., 2013. If We share data, will anyone use them? Data sharing and reuse in the long tail of science and technology. PLoS One 8 (7), e67332. https://doi.org/10.1371/journal.pone.0067332. Wassink, I., Rauwerda, H., Neerincx, P.B.T., van der Vet, P.E., Breit, T.M., Leunissen, J.A.M., Nijholt, A., 2009. Using R in Taverna: RShell v1.2. BMC Res. Notes. 2, 138. https://doi.org/10.1186/1756-0500-2-138. Wilkinson, M.D. et al., 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3, Article number: 160018. http://doi: 10.1038/ sdata.2016.18. White, J.K., van Evert, F.K., 2008. Publishing agronomic data. Agron. J. 100 (5), 1396–1400. https://doi.org/10.2134/agronj2008.0080F. Yeumo, E.D., Alaux, M., Arnaud, E., et al., 2017. Developing data interoperability using standards: a wheat community use case. F1000Research 6, 1843. https://doi.org/10. 12688/f1000research.12234.1. Yu, J., Buyya, R., 2005. A taxonomy of workflow management systems for grid computing. J. Grid Comput. 3 (3), 171–200. https://doi.org/10.1007/s10723-0059010-8. Zhou, Y., Talburt, J. R. 2015. Entity Information Life Cycle for Big Data. 1. Morgan Kaufmann. USA.
13–17. http://doi.ieeecomputersociety.org/10.1109/MCSE.2012.38. Moreau, L., Groth, P., 2013. Provenance: An Introduction to PROV, 1. Morgan & Claypool, USA. Morales, A., Robles, T., Alcarria, R., Cedenõ, E., 2013. On the support of scientific workflows over pub/sub brokers. Sensors 13 (8), 10954–10980. http://doi.org/ doi:10.3390/s130810954. Mullis, T., Liu, M., Kalyanaraman, A., Vaughan, J., Tague, C., Adam, J., 2014. Design and implementation of kepler workflows for bioearth. Conf. Comput. Sci. 29, 1722–1732. https://doi.org/10.1016/j.procs.2014.05.157. Murta, L., Braganholo, V., Chirigati, F., Koop, D., & Freire, J., 2015. noWorkflow: Capturing and Analyzing Provenance of Scripts. Proceeding of IPAẂ14, pp. 71–83. https://doi.org/10.1007/978-3-319-16462-5_6. Nascimento, J.A.P., 2015. RFLOW: uma arquitetura para execução e coleta de proveniência de workflows estatísticos. UFRRJ (master thesis), Seropédica-RJ, Brazil. https://www.embrapa.br/agrobiologia/busca-de-publicacoes/-/publicacao/ 1028905/rflow-uma-arquitetura-para-execucao-e-coleta-de-proveniencia-deworkflows-estatisticos (in Portuguese with abstract in English). Nascimento, J.A.P., & Cruz, S.M.S., 2013. RFlow: Uma Abordagem de Reutilização de Workflows Estatísticos Legados. XII Brazilian EScience Workshop, Maceió: 8 pp. (in Portuguese with abstract in English). NSF. National Science Foundation. 2017. Dissemination and Sharing of Research Results. https://www.nsf.gov/bfa/dias/policy/dmp.jsp. Open Definition, 2017. Open Definition 2.1. http://opendefinition.org/od/2.1/en/. OKFN, Open Knowledge Foundation, 2017. What is open?. https://okfn.org/opendata/. Oinn, T., Li, P., Kell, D.B., Goble, C., Goderis, A., Greenwood, M., Zhao, J., 2007. Taverna/ myGrid: Aligning a Workflow System with the Life Sciences Community. Workflows for e-Science. pp. 300–319. https://doi.org/10.1007/978-1-84628-757-2_19. Parolini, G.J., 2015. The emergence of modern statistics in agricultural science: analysis of variance, experimental design and the reshaping of research at rothamsted experimental station. J. Hist. Biol. 48 (2), 301–335. https://doi.org/doi:10.1007/ s10739-014-9394-z. Pasquier, T., Lau, M.K., Trisovic, A., Boose, E.R., Couturier, B., Crosas, M., Ellison, A.M., Gibson, V., Jones, C.R., Seltzer, M., 2017. If these data could talk. Sci. Data 4, 170114. https://doi.org/10.1038/sdata.2017.114. Prabhu, P., Jablin, T.B., Raman, A., Zhang, Y., Huang, J., Kim, H., August, D.I., 2011. A survey of the practice of computational science. In: Proceedings of SĆ11, pp. pp. https://doi.org/10.1145/2063348.2063374. ProvChallenge, 2010. Provenance Challenge Wiki. http://twiki.ipaw.info/bin/view/ Challenge/WebHome. Runnalls, A., 2013. CXXR: an extensible R interpreter. Comput. Stat. 5 (3), 181–189.
28