A Framework for Provenance Analysis and Visualization

A Framework for Provenance Analysis and Visualization

Available online at www.sciencedirect.com ScienceDirect ProcediaisComputer Science 108CProcedia (2017) 1592–1601 This space reserved for the header, ...

1MB Sizes 27 Downloads 140 Views

Available online at www.sciencedirect.com

ScienceDirect ProcediaisComputer Science 108CProcedia (2017) 1592–1601 This space reserved for the header, do not use it

This space is reserved for the Procedia header, do not use it This space is reserved for the Procedia header, do not use it

International Conference on Computational Science, ICCS 2017, 12-14 June 2017, Zurich, Switzerland

A Framework for Provenance Analysis and Visualization A Framework for Provenance1 Analysis and Visualization 1 Weiner Oliveira1 , Lenitta M. Ambr´osio , Analysis Regina Braga , Victor Str¨oele1 , Jos´e A Framework for Provenance and Visualization 1 1 1 1 , and Campos Maria M. David Weiner Oliveira1 , Lenitta Ambr´ osioFernanda , Regina Braga , Victor Str¨oele1 , Jos´e 1 1 1 1 osio , Regina Braga , 1Victor Str¨ Weiner Oliveira , Lenitta Ambr´ oele1 , Jos´e , and Fernanda Campos Maria M. David Federal University of Juiz de Fora (UFJF), Juiz de Fora, Minas Gerais, Brail 1 , and Fernanda Campos1 Maria David {weiner, lenita.martins}@ice.ufjf.br

Federal University of Juiz de Fora (UFJF), Juiz de Fora, Minas Gerais, Brail {regina.braga, victor.stroeleg, jose.david, fernanda.campos}@ufjf.edu.br Federal University{weiner, of Juiz delenita.martins}@ice.ufjf.br Fora (UFJF), Juiz de Fora, Minas Gerais, Brail {regina.braga, victor.stroeleg, jose.david, fernanda.campos}@ufjf.edu.br {weiner, lenita.martins}@ice.ufjf.br {regina.braga, victor.stroeleg, jose.david, fernanda.campos}@ufjf.edu.br

Abstract Data provenance is a fundamental concept in scientific experimentation. However, for their Abstract proper understanding use, efficient and user-friendly mechanisms are needed. Research in Data provenance is a and fundamental concept in scientific experimentation. However, for their Abstract software visualization, ontologies and complex networks can help in this process. This paper properprovenance understanding use, efficient and user-friendly mechanisms are needed. Research in Data is a and fundamental concept in scientific experimentation. However, for their presents avisualization, framework to assist the and use ofhelp datain provenance using visualizasoftwareunderstanding ontologies andunderstanding complex networks can this This paper proper and use,inefficient and user-friendly mechanisms are process. needed. Research in tion techniques, ontologies andincomplex networks. The the provenance data presents avisualization, framework toontologies assist the andframework use datacapture provenance usingThis visualizasoftware andunderstanding complex networks canofhelp in this process. paper and new information using ontologies provenance graph analysis. Thevisualizagraph is tion generates techniques, ontologies andincomplex networks.and The framework capture the provenance data presents a framework to assist the understanding and use of data provenance using analyzed through complex networks techniques and provide some metrics to help in each node and generates information using ontologies provenance analysis. The graph is tion techniques,new ontologies and complex networks.and The frameworkgraph capture the provenance data analyzes. The new visualization presents and highlights inferences and results. The analyzed through complex networks and provide somegraph metrics to help in framework each node and generates information usingtechniques ontologies andthe provenance analysis. The graph is was used through in thevisualization E-SECO ecosystem to and support the some scientific analyzes. The presents and highlights the inferences and experimentation. results. The analyzed complexscientific networks techniques provide metrics to help in framework each node was used thevisualization E-SECO scientific ecosystem to support the scientific analyzes. The and highlights theNetwork inferences and experimentation. results. The framework Keywords: E-science, Provenance, Visualization, Complex © 2017 Thein Authors. Published bypresents Elsevier B.V. was used in the E-SECO scientific ecosystem to support the scientific experimentation. Peer-review under responsibility of the scientific committee of the International Conference on Computational Science Keywords: E-science, Provenance, Visualization, Complex Network Keywords: E-science, Provenance, Visualization, Complex Network

1 Introduction 1 Introduction Data provenance is seen as a fundamental component for scientific workflow analysis [11] since it 1 Introduction provides information on data origin and processes by which data is passed. It offers reliability

Data provenance is seen as a fundamental component for scientific workflow analysis [11] since it for data, which is important in scientific experimentation [4]. addition, the use of provides information on as data origin and processes byfor which data In is passed. It with offers[11] reliability Data provenance is seen a fundamental component scientific workflow analysis since it provenance models, such as OPM [22] or PROV [12], the interoperability of these data is for data, information which is important in scientific experimentation addition, use of provides on data origin and processes by which [4]. data In is passed. It with offersthe reliability enabled, aiding in the reuse process, the cooperation between research groups and consequent provenance models, such as OPM [22] or experimentation PROV [12], the[4]. interoperability these is for data, which is important in scientific In addition, of with thedata use of improving scientific research. enabled, aiding in the reuse the or cooperation between research groupsofand consequent provenance models, such asprocess, OPM [22] PROV [12], the interoperability these data is Visualization part of software [8]between can assist in understanding proveimproving scientific research. enabled, aiding inasthe reuse process, understanding the cooperation research groups and data consequent nance as a way of communicating the main aspects of complex data intuitively [1]. However, we Visualization as part of software understanding [8] can assist in understanding data proveimproving scientific research. believe that the only use of visualization mechanisms do not guarantee adequate understanding nance as a way of as communicating theunderstanding main aspects of[8]complex data [1]. However, we Visualization part of software can assist inintuitively understanding data proveof dataasthat provenance inuse a scientific context. Inference with believe theofonly of visualization mechanisms do not guarantee adequate understanding nance a way communicating the main aspects ofmechanisms, complex datatogether intuitively [1].ontologies However,and we complex networks analysis also help. mechanisms Thereby, believe that together the information understandof data that provenance inuse a scientific context. Inferencewemechanisms, with ontologies and believe the only of can visualization do not guarantee adequate understanding ing and provenance reliability of results isInference improved, as wellthat as together theirinformation reuse and cooperation complex networks analysis can also help. Thereby, wemechanisms, believe the understandof data in experiments a scientific context. with ontologies and possibilities between different research groups, using these techniques together. In this way, ing and reliability of experiments is improved, as wellthat as the theirinformation reuse and understandcooperation complex networks analysis can alsoresults help. Thereby, we believe data canreliability be between understood more quickly and safely. possibilities research groups, using these techniques this way, ing and of different experiments results is improved, as well as their together. reuse and In cooperation data can be between understood more quickly safely.using these techniques together. In this way, possibilities different researchand groups, 1 data can be understood more quickly and safely.

1877-0509 © 2017 The Authors. Published by Elsevier B.V. Peer-review under responsibility of the scientific committee of the International Conference on Computational Science 10.1016/j.procs.2017.05.216

1 1



Oliveira et al. / Procedia Computer Science 108C (2017) 1592–1601 A Framework for Provenance Weiner Analysis and VisualizationWeiner, Ambr´ osio, Braga, Str¨ oele, David and Campos

This work aims to detail a framework for analysis and visualization of provenance data, in order to i) simplify the understanding of data provenance; ii) improving the reliability of related experiments; iii) get a better reuse of data provenance and iv) improve the cooperation among research groups. In a recent work, a tool called Prov Viewer [18] was presented. This tool presents the graph with different layouts to aid the user comprehension. In addition, it seeks to highlight the most important nodes with techniques for reducing the graph. The advantage of the Visionary framework is to allow the user to create relationships between the nodes of the graph. Techniques are also used in our work to highlight the most important nodes, but these do not change the initial structure of the graph and are used in conjunction with visualization techniques. In another work a provenance visualization approach called AVOCADO [25] generates a visualization based on the data flow. A heuristic was developed by the authors to calculate the node interest, based on its attributes. The level of interest as well as the hierarchical structure is used to generate clusters, and assist in visualization. Visionary’s focus is to perform a prior analysis of the data before presenting it visually to the user. This analysis offers more resources for the exploration of the graph and generates new information and inferences that were not previously presented. In addition, the Visionary analysis technique is domain-free, unlike AVOCADO technique. This paper is organized as follows. Section 2 presents techniques and concepts used in the research as a background information. Section 3 details the provenance graph analysis. Section 4 presents the framework.

2

Background and Related Works

Digital provenance is the documentation of the lifecycle processes of digital objects. Usually the digital provenance describes the responsibility of agents over the administration of digital objects. Moreover, important events occurring throughout the life cycle of a digital object and other information associated with the creation, management and preservation of digital objects is presented [13]. Buneman et al. [5] were one of the first authors to use the concept of digital provenance in scientific experimentation. Lately, Lim et al. [20] discussed the types of provenance, named retrospective or prospective. Retrospective provenance models past execution of computational tasks. It focuses on data transformation during the tasks execution. Otherwise, prospective provenance models the computational tasks, like a recipe for future data derivation. Currently the research in this area is intense [3, 6, 14, 17, 18, 25] and the use of standardized models like OPM [22] and PROV [12] is an important requirement. These models have emerged from scientific events related to the data provenance, like the Provenance Challenges [21], as is the case with OPM and from standardization organizations on the Web, i. e., W3C, as is the case of PROV. OPM (Open Provenance Model) was developed with the purpose of creating an interoperable model between systems. PROV model has the objective of enabling interoperability and the work in heterogeneous environments. PROV defines provenance as being the information about entities, activities and people involved in the production of a data or thing, which can be used to evaluate its quality and reliability [12]. PROV defines the use and production of entities by activities that are influenced by agents. An entity is a digital, physical or conceptual object with fixed aspects. Activity is something that occurs over a period of time and acts upon entities. An agent is something that has some kind of responsibility about an activity, about the existence of an entity or another agent [12]. Seven basic relations are modeled in the PROV. These relationships are directed links and have 2

1593

1594

Oliveira et al. / Procedia Computer Science 108C (2017) 1592–1601 A Framework for Provenance Weiner Analysis and VisualizationWeiner, Ambr´ osio, Braga, Str¨ oele, David and Campos

a fixed element type at each end. All relations have intuitive names, such as: WasDerivedFrom, Used, WasInformedBy, WasGeneratedBy, WasAttributedTo, WasAssociatedWith and ActedOnBehalfOf [12]. PROV model has an ontology called PROV-O. This ontology defines a coding of the PROV data model for the OWL2 Web Ontology Language [19]. From PROV ontology, data is modeled as a directed graph. This graph is used in this work for different analyzes and visualization generation, from the use of concepts related to complex networks and ontologies. Classification is one of the most intrinsic activities of human beings used to facilitate the handling and organization of the huge amount of information that we receive every day. This classification is performed in order to cluster similar elements, with respect to common attributes. Due to the importance of the classification, it is fundamental to develop methods capable of performing this task automatically [7]. Complex networks can be used in this context. It is a new area of science inspired by empirical data. Researchers, which are interested in this area, develop models for complex systems whose structure can evolve. Dynamical systems, such as epidemic spreading and synchronization, are also taken into account by this theory. In data clustering, network and graph have the same meaning [7]. Considering data provenance as a graph that can be analyzed, we can use complex networks concepts in order to classify and cluster provenance information, enabling a better understanding. However, considering the use of provenance models and its associated ontologies, we can also provide rules and inference mechanism that can bring new information beyond those explicitly provided by stored provenance data, improving the data comprehension. Therefore, in this work we use ontologies and inference rules, together with complex networks analyses, to generate new knowledge about provenance data. The use of an OWL provenance data codification enables the use of complex networks analyses together with ontologies inference rules. Complex network analysis techniques are applied to assist the analysis of each node, giving detailed information to improve the visualization interface. Ontologies bring new provenance data and relationships. User interaction, with this improved provenance data, is done through interactive interface that assists the data visualization. It also uses resources to highlight important information generated by the different analyzes. A systematic literature review was conducted to identify works that deal with provenance data visualization. The review selected 13 papers and the most significant are described in this section. The InProv tool [3] presents the provenance graph with a radial layout, called ring. This work focuses on the exploration of provenance data offering many features for user navigation. InProv does not generate new knowledge about the data, unlike the Visionary framework that analyzes the graph and displays the results with the visualization aid. The tool created by Chen et al. [6] generates a visualization for user exploration. The focus of this proposal is to work with large provenance graphs and process data in real time. It uses data mining techniques and statistics to reduce the graph presented. In the Visionary framework, the provenance graph is not changed and some simple techniques are used to reduce the displayed data. In our proposal, analysis is presented to the user, which has the final choice using the result. Visionary seeks to generate inferences on the data, unlike the work presented. The PROV-O-VIZ [14] is a viewer compatible with the PROV model. This viewer uses Sankey diagram layout, focusing on activities within the data flow. This proposal also uses the PROVO to analyze the provenance data. Visionary works with the three node types and generates analysis on all of them, unlike the PROV-O-VIZ that focuses on the activities. The proposal of Karsai et al. [17] is a viewer, also compatible with the PROV model. But this proposal focuses on the grouping of the graph and the automatic generation of names for the groups. This proposal does not attempt to generate new information or inferences upon the data as the 3



A Framework for Provenance Weiner Analysis and VisualizationWeiner, Ambr´ osio,108C Braga, Str¨ oele, David and Campos Oliveira et al. / Procedia Computer Science (2017) 1592–1601

Visionary does.

3

Provenance Graph Analysis

It is possible to extract provenance graph characteristics, generated by the PROV model, to describe its structure. This description can be used to identify similarities between nodes and the influence range of a node. These characteristics are presented by Ebden et al. [9], aimed to analyze the evolution of provenance graphs. Some of these characteristics are common metrics used for network analysis, such as diameter or number of nodes. Additional metrics are adaptations to provenance graphs and their particularities. Four different metrics were selected by Huynh et al. [15] when analyzing provenance subgraphs. The metrics were used successfully by the authors inferring the quality of each node analyzed. The metrics are (i) number of nodes, (ii) number of edges, (iii) diameter and (iv) maximum finite distance (MFD) [9]. Considering the provenance graph as a directed graph G = (V, E), where V and E are nodes and edge sets respectively, the number of nodes is represented by |V |, and the number of edges by |E|. The diameter of the graph is the largest distance found within the graph, where distance is the smallest path between two nodes. After defining the smallest path among all node pairs, the largest of these values is considered the diameter of the graph. Since provenance graphs are directed graphs, even in related graphs it is possible to find infinite distances. Therefore, in determining the distances, the graph is considered temporarily as non-directed. The MFD is a specific metric for provenance graphs and is defined as the largest finite distance from one node type to another within a directed graph G. Considering the three node types present in PROV, we have nine MFDs to be calculated, reaching a total of 12 metrics. All metrics are used to characterize parts of the provenance graph. The graph is divided into several subgraphs related to each node, and through its characteristics, it is possible to relate the subgraphs generated and consequently the related nodes. This is a context-free form that allows a graph and nodes analysis without the need for adjustments to suit the specificities of each scientific research. Each edge of a PROV graph represents forms of influence between the source node and the destination node [23]. In other words, if there is a path between v0 and vi , noted as v0 → ∗vi , vi was potentially influenced by v0 . In this way we can construct a sub graph of DG,a dependency, in which contains only vertices that influence a certain node, as defined in [15]: DG,a = (VG,a , EG,a )

(1)

VG,a = {v ∈ V : v → ∗a}

(2)

EG,a = {e ∈ E : (∃ vs , vt ∈ VG,a )(e = (vs , vt ))}

(3)

The scientist identifying an element or process that has a prominence, if positive or negative, the analysis of the DG,a graph from the node a, allows the scientist to create links between similar nodes. As an example, to understand the importance of such analysis, we can say that by identifying an element or process in the experiment subject to an error, this analysis allows to find nodes that represent the same types of elements or processes that may have the same error due to the similarity of the dependency graph. The same can happen with elements of high quality, finding similar elements, the scientist can increase the importance of these elements in the experiment consequently increasing the importance of the respective nodes in the provenance graph. 4

1595

1596

A Framework for Provenance Weiner Analysis and VisualizationWeiner, Ambr´ osio,108C Braga, Str¨ oele, David and Campos Oliveira et al. / Procedia Computer Science (2017) 1592–1601

With the same principle of the DG,a sub graph, we can determine an IG,a sub graph. IG,a contains only the nodes that were potentially influenced by a given node a. Thus, we can determine the importance of the node a within the graph G. We define IG,a as follows: IG,a = (VG,a , EG,a )

(4)

VG,a = {v ∈ V : a → ∗v}

(5)

EG,a = {e ∈ E : (∃ vs , vt ∈ VG,a )(e = (vs , vt ))}

(6)

To assist scientists in understanding the experiment, the subgraph analysis is highlighted in the visualization. As the IG,a graph from node a, shows its network of influence, the scientist can identify in the network the impact of excluding or modifying the element or process represented by that node. An option in the view allows you to highlight the nodes (changing their size) according to the influence that it has in the network. This enables the scientist to quickly identify the most influential nodes in the network and make decisions, if necessary. The application of these analyzes varies according to the experiment. Thus, while in an experiment the scientist may seek to increase the influence of a node on the graph for better results, in another experiment the scientist may seek to decrease the influence of the nodes on the graph, leaving the graph with fewer connections to also obtain a better result. In this way, the framework and its resources can improve the experiment comprehension, which enables a better reuse, cooperation among partners and reliability. However, it does not replace the role of the scientist who should conduct the analysis and make the final decision.

4

Visionary Framework

Visionary is a framework created to aid in the diagnosis and understanding of provenance data. An initial prototype of the framework was developed in Java, being fully compatible with the PROV model and is available at https://github.com/pgcc/plscience-ecos.git. E-SECO (e-Science Software ECOsystem) platform [10] is a web-based software ecosystem designed to support researchers’ activities during the overall scientific workflow life cycle [2]. The key modules of this platform have already been developed and evaluated in e-Science domain, and are illustrated in Figure 1. E-SECO Development Environment” is a web component where E-SECO code is available, as open source. As a result, the developer community can contribute through software maintenance and evolution. E-SECO platform relies on a Peer-toPeer network (P2P) where different E-SECO nodes can communicate. The ecosystem is made up of artifacts provided by different nodes situated in different institutions, APIs that help the scientific workflow development in its different steps and the open source development environment. Moreover, E-SECO platform is a collaborative environment to support the development and execution of scientific workflows. It also supports the analysis of data provenance during the modeling and execution of workflows. For this purpose, E-SECO implements a Provenance Data Module”. In this module, the data related to tasks, inputs values, output and information from experiments is stored and can be analyzed using ontologies and complex networks algorithms. For space restrictions, E-SECO platform was not discussed in depth. A detailed presentation of this platform was done by Freitas et al. [10] and Sirqueira et al. [24]. The Visionary framework is a web service and connects to E-SECO platform using its services. 5



A Framework for Provenance Weiner Analysis and VisualizationWeiner, Ambr´ osio,108C Braga, Str¨ oele, David and Campos Oliveira et al. / Procedia Computer Science (2017) 1592–1601

Figure 1: E-SECO platform overview

4.1

Framework Steps

In order to analyze and present provenance data, the Visionary framework has five steps: (1) Data Capture, (2) Ontology Store and Inference, (3) Data Transformation, (4) Data Analysis and (5) Visualization. Figure 2 illustrates these steps. 1. Data Capture: The capture phase of the provenance data is done in two ways. First, in the experiment planning, some basic information is registered by the researcher through the E-SECO interface. This information describes the experiment, researchers and research groups, workflows used in the experiment and their modeling, and Scientific Workflow Management Systems (SWfMSs) associated with each workflow and workflows versions. During the workflow execution phase, data are collected by a web service that communicates directly with the SWMSs. Through this service, information about the experiment executions is obtained, such as: start and end time of the execution, input and output data of each task executed by the workflows and the obtained results. The captured data storage is carried out using distributed repositories, modeled according 6

1597

1598

A Framework for Provenance Weiner Analysis and VisualizationWeiner, Ambr´ osio,108C Braga, Str¨ oele, David and Campos Oliveira et al. / Procedia Computer Science (2017) 1592–1601

Figure 2: Visionary framework steps. The data capture from data base (1), ontology process and inferences (2), the data transformation for the visualization (3), data analysis through complex networks (4) and finally the visualization step for the user exploration. to PROV data model, facilitating its interpretation, and the interoperability with other systems. At this stage, information about experiment provenance, experiment execution and the data consumed and produced during the execution is stored. Although it is a large volume of data, the E-SECO platform can manage this data with the support of a peer-to-peer network. Each network node has an E-SECO data repository, allowing data to be stored in a decentralized but uniform way. 2. Ontology Store and Inference: In this step the provenance data is transposed into an OWL document modeled from the PROV-ONE ontology (Cuervas-Vicenttin et al., 2014). The main objective of the ProvONE ontology is to capture data from scientific experiments related to a specific workflow and its derivations and sub workflows. In the ProvONE ontological model, there is no explicit concern with the distributed nature of the current scientific experimentation process. Through this model a scientific experiment cannot be processed in a distributed way, using several workflows, executed in different SWfMS, besides needing to support the collaboration of the users. To support these new features an extension of the ProvONE ontology, called Prov-SE-O, was developed which includes new classes, properties and rules in SWRL (Semantic Web Rule Language). Thus, it was possible to model not only the workflows, but also scientific experiment, as well as to capture important information related to its distributed nature and the need to support collaboration activities between the different agents. In this phase, the captured data and those already stored in the database are inserted into the ontology. Through inference algorithms, implicit knowledge can be derived, such as, experiments and workflows similar 7



A Framework for Provenance Analysis VisualizationWeiner, Ambr´ osio, Braga, Str¨ oele, David and Campos Weiner and Oliveira et al. / Procedia Computer Science 108C (2017) 1592–1601

to each other, or derived from each other; SWMS used for the workflows execution (when they are not explicitly informed); researchers, institutions and research groups involved or influencing the experiment; workflows or external services used in the experimentation process; data consumed and generated by the experiment, among others. 3. Data Transformation: The information contained in the ontology is read and modeled in a different format used by the next steps of Data Analysis and Visualization. This task just connects two different technologies (OWL and JSON). 4. Data Analysis: Data in a new format is analyzed, the graphs DG,a and IG,a are defined for each node and the metrics are calculated for each node. This step is an important part of the framework. The analysis of the subgraph of each node occurs in two ways, identifying the importance of the node to the graph and the similarity between nodes. Considering G = (V, E) the graph of provenance and IG,a = (V , E) the graph of influence generated from node a, where IG,a ⊂ G. The importance of the node is determined by the subgraph IG,a , the number of nodes |V |, and the number of edges |E|. The greater |V |, the greater the set of nodes influenced. In the same way, the greater |E|, the more times the node influences the nodes belonging to |V |, with more edges, more paths can be draw between two nodes. A metric (7) was created to determine the importance of each node. The weight of the number of vertices and edges of each subgraph IG,a is weighted by the total vertices and arrays of the graph. All these metrics determine the importance of the node in the provenance graph. The importance is a key aspect for the scientist to focus their attention. A change in a node that influence great part of the graph must be well planned in a stable version of a workflow, for example. The similarity between nodes is determined by the metrics, number of nodes |V |, number of edges |E|, diameter and MFD values, all taken from the dependence subgraph DG,a . Metrics are calculated and added to each node in the Data Analysis step. Whenever the scientist wants to find similar nodes for a specific node, the visualization shows in percentages, the similarity of the metrics with other nodes. Considering that provenance data can be used to achieve quality and reliability of entities, activities and agents involved [12], previous information from nodes with high quality and confidence, for example, allows the scientist to find other nodes with potentially the same characteristics. The same similarity can be found in cases of nodes with low quality or low reliability. With prior information from untrustworthy nodes, the scientist can search for similar nodes and look more closely at nodes with more similarity, rather than looking at all entities, activities, and agents for error. |V  |.|E  | |V |.|E|

(7)

5. Visualization: At this stage data is arranged visually. Visualization allows user interaction, data detailing and the display of new information, processed by the ontology and complex networks algorithms whenever requested. It also allows zooming, graph browsing, filtering nodes by type and highlights manipulated objects. Two different symbol can be used and can be switched by the user. The visualization displays those symbols presented by PROV [23], and the symbols commonly used in BPMN [16] with an adaptation. Ontology inferences are highlighted in the visualization. All edges with inferred links are shown in red; with affirmed links are green; with both links are yellow. The click in an edge presents all the connections that it has and the connected nodes. Nodes also present more information with the click as type name and their links to the other nodes. Symbols are used to facilitate the node type identification. Figure 3 presents an example. 8

1599

Analysis and VisualizationWeiner, Ambr´ osio,108C Braga, Str¨ oele, David and Campos 1600 A Framework for Provenance Weiner Oliveira et al. / Procedia Computer Science (2017) 1592–1601

Figure 3: Visionary visualization mode.

5

Conclusion

This paper presented a framework to assist in analyzing and understanding data provenance, called Visionary Framework. The framework uses visualization techniques, together with complex network analyzes and ontology inference to highlight new information found and help the exploration of provenance data. The data are analyzed through complex networks techniques and provide some metrics to help in node analyzes. Ontologies rules are used to infer new information and improves provenance comprehension. The framework was used in the E-SECO scientific ecosystem as a proof of concepts. However, a formal evaluation need to be conducted in order to validate the proposal. The Visionary Framework is a first step in order to provide more provenance data comprehension and to improve experiment reuse, reliability and cooperation among research groups. An initial prototype of the proposal was implemented and used. However, new improvements must be done, considering Visionary framework use.

References [1] Bilal Arshad, Kamran Munir, Richard McClatchey, and Saad Liaquat. Position paper: Provenance data visualisation for neuroimaging analysis. arXiv preprint arXiv:1502.01556, 2015. [2] Adam Belloum, Marcia A Inda, Dmitry Vasunin, Vladimir Korkhov, Zhiming Zhao, Han Rauwerda, Timo M Breit, Marian Bubak, and Luis O Hertzberger. Collaborative e-science experiments and scientific workflows. IEEE Internet Computing, 15(4):39–47, 2011. [3] Michelle A Borkin, Chelsea S Yeh, Madelaine Boyd, Peter Macko, Krzysztof Z Gajos, Margo Seltzer, and Hanspeter Pfister. Evaluation of filesystem provenance visualization tools. IEEE Transactions on Visualization and Computer Graphics, 19(12):2476–2485, 2013. [4] Madelaine D Boyd. Inprov: visualizing provenance graphs with radial layouts and time-based hierarchical grouping. Harvard College Cambridge, Massachusetts, 2012. [5] Peter Buneman, Sanjeev Khanna, and Wang-Chiew Tan. Data provenance: Some basic issues. In International Conference on Foundations of Software Technology and Theoretical Computer Science, pages 87–93. Springer, 2000.

9



Weiner Oliveira et al. / Procedia Computer Science 108C (2017) 1592–1601 A Framework for Provenance Analysis and VisualizationWeiner, Ambr´ osio, Braga, Str¨ oele, David and Campos

[6] Yingjie Victor Chen, Zhenyu Cheryl Qian, Robert Woodbury, John Dill, and Chris D Shaw. Employing a parametric model for analytic provenance. ACM Transactions on Interactive Intelligent Systems (TiiS), 4(1):6, 2014. [7] Guilherme F de Arruda, Luciano da Fontoura Costa, and Francisco A Rodrigues. A complex networks approach for data clustering. Physica A: Statistical Mechanics and its Applications, 391(23):6174–6183, 2012. [8] Stephan Diehl. Software visualization: visualizing the structure, behaviour, and evolution of software. Springer Science & Business Media, 2007. [9] Mark Ebden, Trung Dong Huynh, Luc Moreau, Sarvapali Ramchurn, and Stephen Roberts. Network analysis on provenance graphs from a crowdsourcing application. In International Provenance and Annotation Workshop, pages 168–182. Springer, 2012. [10] Vitor Freitas, Jos´e Maria N David, Regina Braga, and Fernanda Campos. Uma arquitetura para ecossistema de software cient´ıfico. WDES 2015, page 41, 2015. [11] Yolanda Gil, Ewa Deelman, Mark Ellisman, Thomas Fahringer, Geoffrey Fox, Dennis Gannon, Carole Goble, Miron Livny, Luc Moreau, and Jim Myers. Examining the challenges of scientific workflows. Computer, 40(12), 2007. [12] Paul Groth and Luc Moreau. Prov-overview. an overview of the prov family of documents. 2013. [13] PREMIS Working Group et al. Data dictionary for preservation metadata: final report of the premis working group. OCLC Online Computer Library Center & Research Libraries Group, Dublin, Ohio, USA, Final report, 2005. [14] Rinke Hoekstra and Paul Groth. Prov-o-viz-understanding the role of activities in provenance. In International Provenance and Annotation Workshop, pages 215–220. Springer, 2014. [15] Trung Dong Huynh, Mark Ebden, Matteo Venanzi, Sarvapali D Ramchurn, Stephen Roberts, and Luc Moreau. Interpretation of crowdsourced activities using provenance network analysis. In First AAAI Conference on Human Computation and Crowdsourcing, 2013. [16] Object Management Group/Business Process Management Initiative et al. Business process model and notation (bpmn) version 2.0. Available on: http://www. omg. org/spec/BPMN/2.0, 2011. [17] Linus Karsai. Clustering Provenance. PhD thesis, University of Sydney, 2016. [18] Troy Kohwalter, Thiago Oliveira, Juliana Freire, Esteban Clua, and Leonardo Murta. Prov viewer: a graph-based visualization tool for interactive exploration of provenance data. In International Provenance and Annotation Workshop, pages 71–82. Springer, 2016. [19] Timothy Lebo, Satya Sahoo, Deborah McGuinness, Khalid Belhajjame, James Cheney, David Corsar, Daniel Garijo, Stian Soiland-Reyes, Stephan Zednik, and Jun Zhao. Prov-o: The prov ontology. W3C recommendation, 30, 2013. [20] Chunhyeok Lim, Shiyong Lu, Artem Chebotko, and Farshad Fotouhi. Prospective and retrospective provenance collection in scientific workflow environments. In Services Computing (SCC), 2010 IEEE International Conference on, pages 449–456. IEEE, 2010. [21] L Moreau and B Lud¨ ascher. Concurrency and computation: Practice and experience–special issue on the first provenance challenge, 2007. [22] Luc Moreau, Ben Clifford, Juliana Freire, Joe Futrelle, Yolanda Gil, Paul Groth, Natalia Kwasnikowska, Simon Miles, Paolo Missier, Jim Myers, et al. The open provenance model core specification (v1. 1). Future generation computer systems, 27(6):743–756, 2011. [23] Luc Moreau and Paolo Missier. Prov-dm: The prov data model. 2013. [24] T´ assio FM Sirqueira, Humberto LO Dalpra, Regina Braga, Marco Antˆ onio P Ara´ ujo, Jos´e Maria N David, and Fernanda Campos. E-seco proversion: An approach for scientific workflows maintenance and evolution. Procedia Computer Science, 100:547–556, 2016. [25] H. Stitz, S. Luger, M. Streit, and N. Gehlenborg. Avocado: Visualization of workflowderived data provenance for reproducible biomedical research. Computer Graphics Forum, 35(3):481–490, 2016.

10

1601