Toward the modeling of data provenance in scientific publications

Computer Standards & Interfaces 35 (2013) 6–29 Contents lists available at SciVerse ScienceDirect Computer Standards & Interfaces journal homepage: ...

Download PDF

4MB Sizes 3 Downloads 75 Views

Report

Full Text

Computer Standards & Interfaces 35 (2013) 6–29

Contents lists available at SciVerse ScienceDirect

Computer Standards & Interfaces journal homepage: www.elsevier.com/locate/csi

Toward the modeling of data provenance in scientiﬁc publications Tariq Mahmood, Syed Imran Jami ⁎, Zubair Ahmed Shaikh, Muhammad Hussain Mughal Center for Research in Ubiquitous Computing, National University of Computer & Emerging Sciences (NUCES), Karachi, Pakistan

a r t i c l e

i n f o

Article history: Received 30 April 2011 Received in revised form 10 November 2011 Accepted 14 February 2012 Available online 16 March 2012 Keywords: Data provenance Scientiﬁc publication Open Provenance Model (OPM) Knowledge management Semantic interoperability

a b s t r a c t In this paper, we implement a provenance-aware system for documenting publications, called PADS. It employs a three-layered provenance hierarchy, which can output diverse types of provenance data related to the research life cycle. From this, we generate different proﬁles for research ventures, reviewers, and authors. PADS employs the standard Open Provenance Model (OPM) speciﬁcation for capturing provenance data, and stores this data as ontological instances. We show that data is retrieved without any apparent delay in the execution time of the queries. We also demonstrate how this data can be used to make useful recommendations to the organizers, in order to manage upcoming research ventures. © 2012 Elsevier B.V. All rights reserved.

1. Introduction The concept of “provenance” typically refers to the documented origin and history of a digital data object, or artifact [1,2]. It depicts all phases of modiﬁcations carried out on this artifact, from its creation to its current digital state. It also keeps track of the data sources that contribute to the modiﬁcation processes, e.g., data that occurs at instantaneous points in some data stream [3]. There are two primary types of provenances: 1) data provenance, i.e., provenance related to a given artifact, and 2) process provenance, i.e., provenance related to a process involving one or more artifacts [2,3]. For instance, in the medical domain, data provenance for a given diagnosis can answer queries such as “Who was the doctor who carried out the diagnosis?”, “What was the initial prescription in this diagnosis?”, “What was the condition of the patient after the second prescription?” etc. On the other hand, process provenance can answer more complicated and crucial queries such as “Why was a diagnosis not successful?”, “Why was the patient shifted to another hospital?”, “Why was the patient recommended a transplant?” etc. In fact, the response to a process-based query involves accessing the data provenance of diverse artifacts, which are involved in this process. For instance, the artifacts involved in the medical domain could be a doctor, patient, prescription, disease, hospital staff etc. The most comprehensive application of provenance has been in the scientiﬁc domain [2], e.g., in bioinformatics [4], chemistry [5], biology [6,7], materials engineering [7], physics and astronomy [8] etc. In all these applications,

⁎ Corresponding author. E-mail addresses: [email protected] (T. Mahmood), [email protected] (S.I. Jami), [email protected] (Z.A. Shaikh), [email protected] (M.H. Mughal). 0920-5489/$ – see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.csi.2012.02.004

the most notable one has been in the domain of scientiﬁc experimentation, mostly in order to validate and reproduce the experimental results, as well as to determine the accuracy and robustness of the process involved in generating these results [9–11]. Amongst other factors, this provenance output is linked to the documentation of experiments in some structured format, which typically occurs within a scientiﬁc publication, or a research paper. In order to describe this concept, it is necessary to initially summarize the process of scientiﬁc publication. The process of scientiﬁc publication (henceforth labeled as “publication”) starts with the announcement of the Call For Papers (CFP), related to a research venture (e.g., conference, workshop etc.) by some scientiﬁc authority (e.g., university, research institution etc.). The CFP provides relevant information for the publication's author(s), e.g., a description of the required research input, the format for writing the publication, the panel responsible for reviewing the publication etc. The release of the CFP initiates the authors' efforts in starting the research work, which may be either collaborative, or on an individual basis. Initially, the author(s) identify the limitations (requirements) of a particular (research-based) domain, which have not been addressed previously by the researchers of this domain. Then, they propose novel approaches (methods, technologies, frameworks etc.) that attempt to cater for these limitations, and conduct experiments to validate these approaches. Finally, they document the experimental results in a publication (research paper). If this publication is accepted by the CFP reviewing panel, it is published in a standard format, as selected by the CFP authority. This allows researchers to both standardize their work, and to present and share their ideas with their related research community [11]. For instance, E-commerce researchers may identify the need to create more usable and adaptive interactions with the online buyers, in order to maximize the sale of products. Such an adaptive approach will then be proposed, experiments will be conducted in order to validate it, and the whole process will be

T. Mahmood et al. / Computer Standards & Interfaces 35 (2013) 6–29

expounded in a publication. The reviewers related to a given CFP either accept, or reject, each submitted publication, based on their experience and knowledge of the publication's domain. As mentioned above, we believe that the provenance of a scientiﬁc experiment is linked to the provenance of the publication that documents this experiment. For instance, consider the publication-based provenance queries such as “In the paper, has the novel idea been validated with an experiment?”, “Does the experiment address the limitations of the current research work?”, “Do the authors select an appropriate methodology for the experiment?” etc. We believe that appropriate responses to these queries are essential, before one can start validating the methodology and results, or even reproducing the experiment. For instance, it could be infeasible to waste resources in re-creating an experiment, whose methodology has not been appropriately selected, or which does not address any limitation of the current research. Also, we can choose to re-create only those experiments which have been validated, or non-validated. In essence, we need provenance data related to the publication, in order to both guide and support the provenance usage of scientiﬁc experimentation. Moreover, the provenance of the publication, by itself, can provide responses to a diverse number of useful queries such as “When was the CFP generated?”, “Who were the authors of a given publication?”, “How is a particular section of the paper linked to another given section?”, “Does a given section contain a particular research content (e.g., experimental methodology)?” etc. Responses to these types of queries can provide information regarding the activities related to the CFP, the actual structure of the publication, and the content present in each section of the publication. In fact, previous work on recording provenance of a given publication is related only to the process of evolution of the idea presented in the paper [11]. So, we can acquire information about the process through which the idea evolved over a set of publications. However, a publication contains more contents (than just the idea), and it's important to determine the link between these contents, in order to acquire a more robust provenance model for the domain of scientiﬁc publication. In this paper, we aim to cater for the aforementioned requirements by designing and implementing a provenance-aware publication documentation system, which we have labeled as PADS. PADS generates two types of provenance data: 1) a provenance hierarchy for scientiﬁc publication, which comprises three layers, i.e., the CFP Layer, the Document Layer, and the Content Layer, and 2) a structured provenance model, or proﬁle, for three entities, i.e., the reviewer, the author and the research venture. Going into a bit more detail, we deﬁne the three layers of our provenance hierarchy as follows: • CFP layer: this layer is the most abstract one, and provides generic provenance data related to the process of CFP of a given research venture, • Document layer: this layer is more speciﬁc than the CFP layer, and provides provenance data related to the process of documenting a publication, which has been submitted to the given research venture, and • Content layer: this layer is more speciﬁc than the Document layer, and provides provenance data related to different research-based contents present within different sections of the aforementioned documented publication. The CFP layer provides generic provenance knowledge regarding the logistics related to a CFP, e.g., the date at which the CFP was announced, the panel of reviewers selected by the CFP authority, the result of the reviewing process etc. Then, the Document layer provides more speciﬁc provenance data related to the structure of the publication, e.g., the number of sections documented in the paper, and the way in which each of these sections is related to the other ones, e.g., the Introduction section is derived from the paper's Abstract section, because it provides a more detailed description of the content of Abstract. Finally, the Content layer provides provenance data related to the research-based

7

contents associated with a given publication, e.g., the limitations of a given research domain, the novel idea that has been presented in the paper, whether this idea has been validated, e.g., through experimentation, mathematical analysis etc. As the Content layer is more speciﬁc than the Document layer, it also depicts how the contents are associated to each other over different sections, e.g., the idea presented in the Introduction is derived from the Abstract, the idea presented in the body of the paper is derived from the Introduction (as the former is a more detailed version of the latter) etc. Moreover, the data from our provenance hierarchy is used to derive three types of structured proﬁles, i.e., the proﬁle of a given research venture, a proﬁle for each author submitting to this venture, and a proﬁle for each reviewer associated with this venture. These proﬁles can assist the organizers of a research venture in organizing upcoming research ventures, e.g., by recommending sponsors for this venture, selecting the authors to whom the CFP should be emailed (based on the authors' research interest), recommending possible reviewers for this venture etc. For a given research venture, PADS tracks the aforementioned provenance data, basically through two methods. It asks the organizers, reviewers and authors to input relevant provenance data, e.g., data concerning the CFP, output of a review process, and the documented sections etc. It also automatically tracks some of the data, e.g., the time spent revising a given section, the list of keywords in a given research paper etc. The complete provenance data is represented by using the standard data model for provenance called the Open Provenance Model (OPM, version 1.1) [12], which basically provides a graphical structure called “provenance graph” for modeling provenance-based relationships. In these graphs, the relationships between different graphical nodes are depicted using standard OPM notations. Hence, provenance graphs generated for different scientiﬁc publications (by using PADS) are semantically inter-operable, facilitating the comparison of provenance across different publications, as well as the integration of our hierarchy with other related provenance models, e.g., scientiﬁc experimentation. From the implementation perspective, we store our provenance data as instances of six ontological classes, one each for a layer of our provenance hierarchy, and one each for our proﬁles. We model this ontology through the OWL language1 in the Protégé editor, 2 and store it in a provenance data store. So, the application of our provenance system to different research ventures results in different ontological instances being stored in our data store. Finally, users can query these instances by using the standard SPARQL query language.3 To the best of our knowledge, our contributions, i.e., the provenance hierarchy, the proﬁles, and the set of recommendations for managing research ventures, are novel; they have not been previously implemented in any state of the art provenance system for scientiﬁc publication. Moreover, both our hierarchy and proﬁles are extensible, i.e., researchers can add or modify their provenance data, as needed from these entities. In order to test and validate our proposed approach, we have documented provenance data related to three conferences (research ventures) within PADS. For one of these conferences, we show snapshots of the provenance data acquired from our proposed hierarchy, and also from the conference proﬁle and the author proﬁle. We also demonstrate the use of SPARQL within Protégé, and comment on the time taken to execute these queries. Through these outputs, the reader can realize that our provenance data can be used to make useful recommendations for managing upcoming research ventures. We show a possible list of such recommendations for an upcoming conference. The rest of paper is structured as follows. In Section 2, we discuss the state-of-the-art provenance systems that have been applied to the domain of both scientiﬁc publication and scientiﬁc experimentation. Then, in Section 3, we describe the basic entities and relationships of 1 2 3

http://www.w3.org/TR/owl-ref/ http://protege.stanford.edu/ http://www.w3.org/TR/rdf-sparql-query/

8

T. Mahmood et al. / Computer Standards & Interfaces 35 (2013) 6–29

provenance graphs, as speciﬁed by OPM, and in Section 4, we describe the complete process of scientiﬁc publication. Later on, in Section 5, we describe our proposed provenance system, i.e., provenance tracking, provenance hierarchy and the three proﬁles, along with the system for representing and querying the provenance data. Then, in Section 6, we test and validate PADS, by showing the possible provenance outputs, as well as the execution of SPARQL queries within Protégé. Finally, in Section 7, we conclude our work, present some of its limitations, and describe our future work. 2. Related work Perhaps the work most related to our approach is the one done by Wood et al. [11], in which the authors propose two concepts in order to apply data provenance to the domain of scientiﬁc publication. The ﬁrst concept is that of theory provenance, i.e., the provenance of idea(s) presented in a publication, e.g., how this idea evolves over a set of given publications. The second concept is that of semantic citation, i.e., the association of pre-deﬁned semantics with a given research paper, e.g., the knowledge that the idea presented in the new paper is quite similar to the one presented in some previous one. The authors show how these two concepts can provide a more concrete understanding of the data and processes related to a scientiﬁc publication. In comparison, our approach presents a more detailed provenance framework, due to three primary reasons: 1) the Content layer is not only concerned with the provenance of ideas, but also, with all types of research-based contents (refer to Section 4), 2) our proposed hierarchy not only assists in understanding the evolution of a given content across previous research papers, but also provides provenance knowledge about how this content is linked with the other contents within the paper itself, and 3) we do not concern ourselves only with the provenance of the content, but also with the provenance of both the documentation and the CFP as well. A domain related to our work is that of scientiﬁc experimentation. In our opinion, the most prominent work in this category is the one done by Hunter and Cheung [13], in which the authors design and implement a tool called Provenance Explorer. This tool supports a graphical representation of the provenance for the complete process of scientiﬁc experimentation, e.g., the raw data sets used in the experiment, the logistic details of the experiment (e.g., where it was conducted, computational and human resources used, ﬁnances incurred etc.), the analytical details of the experiment (e.g., the criteria used for analysis, how many people were involved and their individual analytical contributions etc.), the hypotheses checked in the experiment etc. The Provenance Explorer also supports the creation of a composite digital object called a “scientific publication package”, which links the raw experimental data to the ﬁnal conclusions through a sequence of pre-deﬁned events. It also employs logic-based inference in order to deduce provenance relationships, e.g., “the event of storing the results preceded the event of their analyses”, and “the event of results' analyses preceded the event of hypotheses-checking”. Hence, “the event of storing the results preceded that of hypotheses-checking”. The composite package also allows the authors to protect their intellectual property, i.e., illustrate all provenance data required to validate and use their experiments, without comprising their competitive edge over state-of-the-art work. In fact, our work can strengthen this composite package, by providing the provenance for the process of CFP as well as for the paper's structure (along with that of the experimental process). We will then be able to answer queries that link our proposed provenance hierarchy to the experimental process, e.g., “For a given CFP, how many data sets were documented?, how many experimental validations were documented?, how many publications, exploiting a particular methodology or data sets, were accepted (or rejected) by the reviewers?” etc. In other words, our work can be used to provide a complete (and more thorough) scientiﬁc publication package. Another comprehensive work in the domain of scientiﬁc experimentation is the one done by Miles et al. [14], in which the authors initially

identify 23 use cases from experiments performed in the domain of biology, chemistry, computer science and physics. Then, they mention how provenance can be used to track useful information about these experiments, e.g., determining whether a protein sequence of a previous experiment was processed through a standard, biological computer service, or determining the high-level plan used for a previous chemical experiment in order to reproduce this experiment etc. Later on, the authors propose a robust architecture that models this provenance data. In another work [15], Miles et al. motivate, devise and implement a provenance framework that validates experimental results through reasoning over their provenance information, and through semantic descriptions of domainspeciﬁc computer services. The authors validate their framework by applying it to the domain of bio-informatics. Additionally, a wide range of domain-speciﬁc provenance systems have been implemented that track, represent, and store the provenance of the experiments performed in their respective scientiﬁc domains. They also facilitate querying of provenance of the generated data, along with contributing and effecting entities. For instance, myGrid [4,9] is a biological provenance system that integrates data from distributed resources and provides service layers for interfacing the complexity of data integration. The results of biological experiments are extracted by different regional researchers in order to investigate different research problems collaboratively, and to store the solutions in gridbased frameworks [4]. Along with this, semantic web services are being used in diverse domains, in order to link and describe complex scientiﬁc information through the use of semantic web services [5,6]. Also, Trio [10] is a relational database management system that keeps track of the lineage, or the provenance, of data objects, e.g., sales data can be derived from customers' invoice data, net proﬁt (or loss) can be derived from accounting data etc. The authors apply Trio to a running example of Christmas Bird Count (CBC), i.e., people counting millions of birds on Christmas day at diverse locations across the globe. They demonstrate diverse lineage queries on CBC, e.g., “How much a particular bird count was affected by environmental and geological changes?”, “How is the count for a particular region different than the counts of the previous two years?” etc. Trio also supports querying of imprecise information, which is generated due to the uncertainty of the human behavior in counting the birds. Moreover, in [16], Hartig proposes a framework for Web-based applications that provides provenance information for both the creation of web data, and also about its access, e.g., “Who were the creators of the data?”, “When was it created?”, “How many users have accessed this data?”, “How many different web services serve this access request?” etc. The author also discusses different options to both store and query such provenance data. Additionally, in [17], the authors propose and describe the implementation of a Provenance Aware Storage System, or PASS, that maintains the complete history of a given artifact. The paper focuses on the advantages of using a PASS, as compared to the crude method of annotating provenance information in separate databases. Through a running example, it veriﬁes that PASS is able to automatically generate and maintain system-level provenance (without any user intervention), along with avoiding any provenance loss. Finally, in [18,19], the authors propose KARMA, a framework for the collection and management of provenance data from scientiﬁc workﬂows that are composed of grid and web services. KARMA records uniform and usable provenance data, facilitates the validation of workﬂows, ensures the quality of generated data products, minimizes the modiﬁcation burden on the service authors, and reduces the performance overhead on the workﬂow engine and the services. The authors justify the performance of KARMA by applying it to various services and workﬂows. In our opinion, two issues need to be high-lighted concerning the aforementioned works. Firstly, some of the works are domain-speciﬁc, i.e., satisfying the provenance needs of only a speciﬁc scientiﬁc domain. Secondly, none of the papers bridge the wide gap that exists between the provenance of scientiﬁc experimentation, and that of other activities concerning a scientiﬁc publication, e.g., the CFP. In essence, the

T. Mahmood et al. / Computer Standards & Interfaces 35 (2013) 6–29

effect of all the research activities (related to a scientiﬁc publication) on the experiments has been completely ignored. In contrast, our provenance hierarchy can be applied to any research domain, and provides information that can link abstract CFP provenance to the speciﬁc research content provenance, e.g., related to experiments.

9

at which it was created, as well as a usage time, i.e., the time for which it was used. In our work, we have employed timestamp information within the provenance hierarchy, but we do not display this information in our provenance graphs. Rather, we indicate which timestamps are recorded through a set of provenance queries associated with our hierarchy (refer to Section 5).

3. Open Provenance Model 4. Process of scientiﬁc publication The Open Provenance Model (OPM) resulted from the efforts of the scientiﬁc community to represent the provenance of digital artifacts [12]. The ﬁrst outcome of these efforts was the OPM version 1.00, released in 2007 [20]. Later on, a modiﬁed OPM version 1.0.1 was released, and since December 2009, the latest version is OPM version 1.1 [12], which we shall employ in our work. The main objectives of OPM include facilitating the exchange of provenance-related information, facilitating developers in the construction of tools for the storage and management of provenance, and to set the core rules for provenance representation, applicable to any given domain. The OPM version 1.1 allows modeling of provenance information, i.e., the causes of actions, and dependencies of resultant artifacts, in the form of a directed graph, which is labeled as a “provenance graph”. Generally speaking, the nodes of this graph represent “entities” (artifact, process, agent — deﬁned below), whereas edges represent the causal relationship among these entities. We now deﬁne the terminology (along with its representation format) related to this graph: 1) An artifact is an immutable piece of state; it is represented by an ellipse, 2) A process comprises a sequence of actions caused by an artifact, or affecting the transformation of an artifact; it is represented by a rectangle, 3) An agent is a contextual entity that controls the execution of a process; it is represented by an octagon, 4) The source of an edge in the graph starts from the dependent entity (one that is being affected), and terminates at the independent entity (one that is causing the effect); in this context, each edge is labeled as a causal dependency. OPM version 1.1 supports ﬁve different types of causal dependencies, i.e., used, wasGeneratedBy, wasTriggeredBy, wasDerivedFrom, and wasControlledBy [21]. We describe them as follows: • used occurs on an edge from a (dependent) process to an (independent) artifact, and implies that the process uses the artifact to complete its execution, • wasGeneratedBy occurs on an edge from an artifact to a process, and implies that the process was required to initiate its execution, in order for the artifact to be generated, • wasTriggeredBy occurs on an edge from one process to another, and implies that a start of the execution of the independent process is required in order to start the execution of the dependent process, • wasDerivedFrom occurs on an edge from one artifact to another, and implies that the contents or functionality of the dependent artifact are derived from the independent one; the latter must have been generated in order for the former to be generated, and • wasControlledBy occurs on an edge from a process to an agent, and implies that the start and end of the process is controlled entirely by the agent; in the situation where a process is controlled by more than one agent, we need to specify the exact role of each agent in controlling the process. In OPM version 1.1, temporal information can also be incorporated with a provenance graph, but this will not affect the causal dependencies within the graph. Timestamp information may be associated with instantaneous occurrences of different artifacts associated with a process, in order to understand, for instance, which causal relationships occur before (or after) the others. Moreover, each process has a starting and ending time, whereas an artifact has an initiation time, i.e., the time

In this section, we expound on the process of scientiﬁc publication. This process provides a platform for potential scientists and researchers to publish and share their novel contributions to a given research domain, be it any type of research output, e.g., methodology, framework, technology, algorithm, software etc. Moreover, it is not necessary that the research output be always successfully validated, e.g., researchers can publish results showing that their proposed technology, for a given domain, could not be validated, and is hence infeasible for this domain. The process of scientiﬁc publication is executed in the context of a given research venture, i.e., a workshop, conference or journal, and follows almost a similar workﬂow, which we describe below. This process starts when a particular organization announces a Call For Papers (CFP), i.e., a speciﬁcation of almost all the research activities that have to be executed by the different participants associated with a scientiﬁc publication. These participants can be generally classiﬁed as authors, reviewers, and the program chair. The program chair is the primary person (head) within the organization, who is responsible for managing all activities related to the process of scientiﬁc publication. The CFP primarily provides information about the scope of the research venture, e.g., the particular scientiﬁc domain of the research venture, and instructions required for authors to write and submit their publications, e.g., the length, formatting style, and method of submission of the publication, along with the expected date at which the authors will receive the responses for their submitted publication.4 These responses are either “accept” or “reject”, i.e., the publication has been accepted, or rejected, by the reviewers, respectively. The reviewers are part of a program committee, which is headed by the program chair, and are supposed to follow a speciﬁc format for submitting their decisions, e.g., through criteria speciﬁed within a review form. Typically some time elapses between the submission of papers and the ﬁnalization of the reviewers' decisions, e.g., 2 months, 10 months etc. It is the responsibility of the organizers to publish the accepted papers, typically in the proceedings of standard digital libraries, e.g., libraries of IEEE,5 ACM, 6 Springer7 etc. This publishing process allows researchers to query the state-of-the-art for their domain of interest, correctly determine the contribution of their own publication, and collaborate with other researchers in their domain. It also brings about a certain level of sophistication, i.e., researchers understand they are required to demonstrate their understanding of the state-of-the-art, clearly present and describe their novel contributions (research outputs), and most importantly, validate these contributions through scientiﬁc experiments. Fig. 1 illustrates the process of scientiﬁc publication outlined above. The process is initiated with the announcement of the CFP. The authors then write and submit their publication. If the submitted publication is within the scope of the CFP, it is passed on to the reviewers; otherwise, it is rejected. The reviewers submit their “accept” or “reject” decisions to the program chair, who moderates these decisions, before conveying them to the authors. The authors can be asked to re-submit their accepted papers with either major, or minor, revisions. In this scenario, the reviewers can be asked to review the re-submissions. If these are satisfactory, then the authors are asked to submit a camera-ready version of

4

For an example of a CFP, see http://www.ht2009.org/cfp.php http://www.ieee.org/publications_standards/publications/authors/ publish_beneﬁts.html 6 http://www.acm.org/publications 7 http://www.springer.com/?SGWID=9-102-0-0-0 5

10

T. Mahmood et al. / Computer Standards & Interfaces 35 (2013) 6–29

Fig. 1. The process of scientiﬁc publication.

their publication, i.e., one that conforms to standard formatting guidelines of the publishing organization (IEEE, ACM etc.), which could be more speciﬁc (or different) from the one speciﬁed in the CFP. When the camera-ready version is received, the publication gets published in one or more digital libraries. We will now describe the architecture of how our proposed three-tier hierarchy can provide useful provenance information related to the aforementioned process. 5. System architecture of the provenance system The system architecture for our proposed hierarchy is shown in Fig. 2. It consists of ﬁve primary modules: 1) Provenance Tracking

Layer, 2) Provenance Hierarchy, 3) Proﬁling Layer, 4) Provenance Data Store, and 5) SPARQL Query Tool. We describe them in the following sections. 5.1. Provenance tracking layer The provenance tracking layer records, or tracks, all the provenance data related to our provenance hierarchy. To this end, we implement a tracking agent in order to acquire provenance data for each layer of our hierarchy, i.e., the CFP tracker for the CFP layer, the Document tracker for the Document layer, and the Content tracker for the Content layer. These agents primarily operate in the background of PADS. In order to

T. Mahmood et al. / Computer Standards & Interfaces 35 (2013) 6–29

11

Fig. 2. Proposed provenance system for scientiﬁc publication; PADS = provenance-aware publication documentation system, PDS = provenance data store.

support these agents, PADS provides word processing features for documenting a publication, in order to activate a set of features for tracking diverse provenance elements related to our hierarchy, e.g., time, dates, titles, names, contents etc. (the actual provenance data for our hierarchy is detailed in Section 5.2). To this end, PADS provides web-based forms, which allow the tracking agents to track their respective provenance information from the documented paper. Currently, these forms can be populated both manually and automatically. Speciﬁcally, the forms related to the CFP tracker are populated manually, while the ones related to the Document and Content trackers are populated automatically. The manual insertion can be carried out by two types of users: 1) one or more authors, or 2) a provenance engineer, who can be either one of the research venture organizers, or some person trained speciﬁcally for the entry of provenance data. PADS provides a separate login ID (account) for each type of user. For the CFP tracker, it will query both the authors and the provenance engineer for information which each one can provide. For instance, Fig. 3 shows a part of the form related to the CFP tracker, which queries for basic CFP provenance data. The basic data for the research venture includes, for instance, its name, date, location, URL etc. When the authors start documenting a

paper, one of them can populate all the ﬁelds shown in Fig. 3 except the last one (number of accepted papers), which can be ﬁlled in by the provenance engineer later on (when this information is available). We note that PADS also allows different authors to work simultaneously with each other on the same paper. This is basically done through employing version control techniques,8 e.g., SVN. 9 Moreover, Figs. 4 and 5 show parts of other forms being tracked by the CFP tracker, which are related to provenance data for the reviewers and authors respectively. A glance at these forms shows that the reviewer form has to be completely ﬁlled in by the provenance engineer, while the author form can be completely ﬁlled in separately by each author. Also, Fig. 6 shows part of the form being tracked by the Document tracker. Here, all information is tracked and ﬁlled-in by the Document tracker automatically, from the text of the documented paper. For instance, the number of times that the author revised a given section is tracked. Speciﬁcally, PADS is able to recognize the particular section on which the author is currently working. This is done through 8 The details of this functionality are outside the scope of this paper. We intend to provide them in a future journal paper. 9 http://tortoisesvn.net/

12

T. Mahmood et al. / Computer Standards & Interfaces 35 (2013) 6–29

Fig. 3. CFP tracker form for basic CFP provenance. Fig. 5. CFP tracker form for author provenance.

maintaining a list of the line numbers at which the author has documented headings, or sections. Then, we simply detect the line number(s) (documentation area) at which the author is typing, and determine the section which occurs immediately before this number(s). This section is the one on which the author is working. Then,

the tracking of revision time proceeds as follows: suppose that some author worked on Section A of her document on her ﬁrst login to PADS. Then, a counter, depicting the number of revisions on Section A, would be set to 1. If the author works on Section A again on her second login, then this counter would be set to 2, and so on. Besides this, we also track the revision time for each section. Specifically, as soon as we detect typing activity within a section, we start a stop-watch, which records the documentation time until activity is detected in another section.10 Besides this, information about the title, the number and title of sections, documented keywords etc., can all be easily gleaned by the Document tracker from the documented paper. Finally, Fig. 7 shows part of the form that is tracked and ﬁlled-in by the Content tracker (all the tracked contents are listed in Section 5.2). In order to detect the different contents that have been documented in different sections, the Content tracker applies a limited amount of natural language processing techniques i.e., using computer algorithms to make sense of written text [22]. It also pre-processes the acquired data before converting it into provenance information, e.g., by removing any missing or null values, removing inconsistent or incomplete data, normalizing and/or discretizing the data etc. This is necessary in order to ensure the data quality of the provenance information (for a detailed discussion on pre-processing, see [23]).11 5.2. Provenance hierarchy In this section, we will describe our three-layered provenance hierarchy for scientiﬁc publication. As outlined in Section 1, we 10

In the future, we intend to improve this method of recording time. Details regarding the use of text pre-processing and natural language processing in PADS are outside the scope of this paper. We intend to provide them in a future journal paper. 11

Fig. 4. CFP tracker form for reviewer provenance.

T. Mahmood et al. / Computer Standards & Interfaces 35 (2013) 6–29

13

construct and depict this hierarchy according to the speciﬁcations of the Open Provenance Model version 1.1 [12]. Our hierarchy consists of three layers, i.e., the CFP layer, the Document layer, and the Content layer. We discuss them individually below. 5.2.1. CFP layer The provenance graph for the CFP layer is shown in Fig. 8. In this graph, we have not speciﬁcally illustrated the timestamp information of the respective CFP-related activities. This is because OPM version 1.1 doesn't support an appropriate depiction of timestamps [3]. Hence, we will consider timestamp information only as part of provenance data. In the following, we initially explain the role of four agents shown in Fig. 8: • Author(s): this represents the (one or more) author(s) of publication, • Program Chair: this represents the head of the organization/institution that is managing the research venture, • Program Committee: this represents the selected group of reviewers for a given research venture, and • Organizing Committee: this represents the selected group of people who execute diverse types of tasks related to the research venture, e.g., deciding the location of the venture, selecting the academic dignitaries for making keynote speeches, organizing the presentation schedule of the author(s) of accepted publications etc. Now, we explain the role of four artifacts shown in Fig. 8: • Research Paper: this represents the ﬁrst draft of the publication as documented by the author(s), which could be either accepted or rejected, • Camera Ready Paper: this represents the ﬁrst draft of an accepted publication (i.e., one conforming to standard formatting guidelines), • Published Paper: this represents the second (and ﬁnal) draft of an accepted publication, i.e., the one actually printed in the proceedings of some digital library, and • Publisher: this represents the name of the publishing organization, e.g., IEEE, ACM etc. Finally, we explain the role of the eight processes shown in Fig. 8: Fig. 6. Document tracker form for document (paper) provenance.

• Announcement and Propagation of CFP: this represents the process of designing and announcing the CFP, e.g., on a website, and then propagating it to various mailing lists, research-based bulletin boards, wikis etc.,

Fig. 7. Content tracker form for content provenance.

14

T. Mahmood et al. / Computer Standards & Interfaces 35 (2013) 6–29

Fig. 8. The provenance graph of CFP layer.

• Paper Documentation: this represents the process of documenting, or writing, the publication according to a pre-deﬁned format (as speciﬁed in the CFP), • Paper Submission: this represents the process of submitting the paper to a given research venture, e.g., by using some submission portal like EasyChair, • Paper Review: this represents the process related to the logistics of reviewing a publication, e.g., determining which publication should be assigned to which reviewer, the timeframe given to the reviewers to submit their reviews etc., • Paper Acceptance/Rejection: this represents the process related to the criteria of the reviewers, through which they decide to accept, or reject, a publication, e.g., based on their background knowledge, experience etc., • Publishing: this represents the process through which an accepted publication gets published in the proceedings of digital libraries, e.g., the timeframe and payment schedules for publishing etc.,

• Author(s) Registration: this represents the process through which authors pay to the research venture, in order to get their accepted papers published, e.g., online payment through credit card, and • Publication Event: this represents the process of conducting a research venture (carried out by the organizing committee). Having described the agents, artifacts and processes, we will now enlist and explain how they are linked to each other in Fig. 8: • The program chair controls the announcement and propagation of the CFP, by identifying and ﬁnalizing its content, and managing its distribution to diverse individual contacts and emailing lists, • This announcement triggers the authors to document their publications, according to the format speciﬁed in the CFP, • This documentation process, controlled entirely by the one or more authors, generates a research paper (publication) as output, • Only when this paper is complete, the author(s) can submit it (control its submission), i.e., send a copy of this paper to the given

T. Mahmood et al. / Computer Standards & Interfaces 35 (2013) 6–29

•

•

•

•

• •

• •

• •

• • • • • • • • • • • • • • • • • • • • • • • • • • •

research venture, typically by using email, or some online submission portal, The program chair also controls this submission process, e.g., by verifying the submitted paper, and resolving any functionality issues related to the portal, The submission process triggers the process of reviewing, in which completed research papers are reviewed by one or more member(s) of the program committee, The review process is controlled by both the program committee and the program chair (the program committee reviews the paper according to the instructions speciﬁed by the program chair), The review process triggers the acceptance (or rejection) of a paper, in which the reviewers explain and justify their reviews to the authors, The program chair controls the logistics related to these decisions, e.g., verifying them, communicating them to the authors etc., The decisions of acceptances are required to generate the cameraready paper, which is derived from the research paper (according to the required format of the publishing organization), The acceptances also trigger the process of publishing, which the publisher employs in order to publish accepted papers, These published papers are derived from their camera-ready counterparts, according to the requirements of the publisher, and are the sole outputs of the publishing process, In order to present these papers at the research venture (event), the author(s) can choose to register (control their registration) for this event, This registration process triggers the organizing committee to manage this event. We will now enlist the provenance information that can be provided by the graph in Fig. 8. • What was the name of the research venture? What are the dates at which this venture was held? What was the location at which this venture was held? What was the URL of the website related to the given venture? What was (were) the name(s) of the organization(s) who arranged this venture? What was the name and afﬁliation of the program chair? What were the names and afﬁliations of each member of the program and organizing committees? At which date was the CFP announced? To how many email recipients was the CFP dispatched? What was the deadline (date) for paper submission? Was the paper submission deadline extended? If yes, what was the new deadline (date)? How many late submissions were entertained for review? How many late submissions were not entertained? Was a submission portal set-up to acquire submissions? If yes, what was its name? Were the submissions acquired through email? If yes, what was this email address? In total, how many paper submissions were received? What was the time period (in days) of the review process? On the average, how many papers were assigned to each reviewer? What was the acceptance rate? (How many papers were accepted/ rejected?) What was the remuneration amount given to each reviewer, per review? What was the total number of reviews, provided by all the reviewers? How many reviews were not accepted (controlled) by the program chair? In total, how many camera-ready submissions were received? Were all camera-ready submissions successfully published? What was the name and address of the publisher? What was the date of publication of the proceedings? How many authors registered for the conference? How many authors attended the conference?

15

• What are the names and afﬁliations of the keynote speakers at the given research venture? • For a submitted research paper with a given title: ○ Who was (were) the name(s) of the author(s) of the paper, their afﬁliation(s) and their email address(es)? ○ At which date was the paper submitted? ○ Did the length of the paper (in number of pages) comply with the required length, as speciﬁed in the CFP? ○ At what date was the author informed about the paper's decision? ○ How many reviews were given for the paper? ○ Was the paper accepted? ○ If yes, did one of the authors register for the research venture? ○ If yes, did one or more authors attend the research venture? We will not expound on these queries, as we believe they are explanatory. Also, users can pose these queries to the SPARQL Query Tool (refer to Fig. 2), which is described in Section 5.4. The same is true for queries related to the Document and Content layers. These queries provide comprehensive provenance information related to the CFP layer. We present a case study regarding some of these queries in Section 6. 5.2.2. Document layer We recall from Section 1 that the Document layer provides more detailed (speciﬁc) provenance data related to the research paper artifact shown in Fig. 8. For this, the relevant provenance data of the research paper from the CFP layer is passed to the Document layer (labeled “Research Paper Data” in Fig. 2), e.g., the title of the paper, the authors etc. Then, the Document layer provides provenance data for the different sections documented in this research paper. In our work, we have identiﬁed eight such sections, i.e., Abstract, Introduction, Related Work, Paper Body, Experimental Methodology, Results and Discussion, Conclusions and Future Work, and References and Keywords. In our opinion, these sections represent the generic structure of publications that are associated with scientiﬁc experimentation, i.e., in which a validation or proof has been documented. In the provenance graph shown in Fig. 9, we have illustrated these sections as artifacts, and we describe them as follows: • Abstract: this represents the label of the ﬁrst section in a paper, and presents a very brief, yet complete, summary of the paper's content, • Introduction: this represents a typical label of the second section in a paper (it can be also labeled as “Background”, “Background and Motivation” etc.); it gives a more detailed summary of the paper's content (as compared to the Abstract), • Related Work: this represents the label for the third section in a paper (although it may be documented elsewhere in the paper as well), and shows how the novel work presented in the paper, for a given domain, is different and/or better than the previous work in this domain, • Paper Body: this represents a generic label (different for each paper) for the fourth section in a paper, in which the authors describe their novel contributions in detail; we note that the paper body can span more than one section, • Experimental Methodology: this represents a typical label for the section in which the authors describe their selected experimental setting, in order to validate their contribution (it can be also labeled as “Experimental Setup”, “Experimental Conﬁguration”, “Test Bed” etc.), • Results and Discussion: this represents a typical label for the section in which the authors describe their results, and (non-)validate their idea, • Conclusions and Future Work: this represents a typical label for the concluding section, in which the authors summarize the contents of the paper, and document future directions of work (although these two data are different, they are typically documented in the same section),

16

T. Mahmood et al. / Computer Standards & Interfaces 35 (2013) 6–29

Fig. 9. The ﬁrst provenance graph of the Document layer.

• References and Keywords: this represents a combined label for two different sections, i.e., “References” documents the bibliographical references that are referred by the authors in order to explain their work, and “Keywords” documents a set of words that summarize the complete content of the paper. We have considered these sections together because they are related to the other sections in exactly the same way in the provenance graph (described later on). We note that this classiﬁcation of sections might not be exactly applicable to each research paper. For instance, in some papers, the authors propose a certain framework or algorithm, but do not conduct any experiments to validate its usability. So, no content will be available for the Experimental Methodology and Results and Discussion sections. Also, some papers might not mention their future directions of work, e.g., due to space constraints in paper writing. Due to this variety in documentation, we have considered all possible sections, and the Document layer will provide provenance information related only to the documented sections. Along with the eight artifacts in Fig. 9, we have also identiﬁed one process, and several agents, that are associated with these artifacts. We illustrate them in Fig. 10, 12 and describe them as follows: • The agent “Author 1” and “Author N” represent one or more authors of the publication, and

12

We show them in another ﬁgure for the sake of clarity.

• The process “Collaborative Effort of Authors” represents the process through which the authors have collaborated with each other, in order to document the publication. We will now describe how the entities shown in Figs. 9 and 10 are related to each other: • The Introduction is derived from the Abstract, and provides more detail about the contents of the Abstract (described in the next section), • A brief summary of the Related Work is derived from the Abstract, while a more detailed version is derived from the Introduction, • The contents of the Related Work should be mentioned in both the Abstract and Introduction, • A brief summary of the Paper Body is derived from the Abstract, while a more detailed version is derived from the Introduction, • The contents of the Paper Body may, or may not, be mentioned in the Abstract, or Introduction; if they are not mentioned, then no derivations from these sections are possible, • A brief summary of the Experimental Methodology is derived from the Abstract, while a more detailed version is derived from the Introduction, • The contents of the Experimental Methodology may, or may not, be mentioned in the Abstract, or Introduction; if they are not mentioned, then no derivations from these sections are possible, • The Experimental Methodology can also be derived from the Paper Body, because the characteristics of the novel contribution can inﬂuence the selection of the experimental setting, e.g., an adaptive technology for a web-based system can be best evaluated through an evaluation with real users in a controlled laboratory setting,

T. Mahmood et al. / Computer Standards & Interfaces 35 (2013) 6–29

17

Fig. 10. The second provenance graph of the Document layer.

• A brief summary of Results and Discussion is derived from the Abstract, while a more detailed version is derived from the Introduction, • The contents of Results and Discussion should be mentioned in both the Abstract, and Introduction, • The Results and Discussion is derived from the Experimental Methodology, as the results are always inﬂuenced by the selected experimental setting, e.g., the subjective ratings provided by real users to an online questionnaire, while specifying their satisfaction with the adaptive web-based technology (mentioned above), • A brief summary of Conclusions and Future Work is derived from the Abstract, while a more detailed version is derived from the Introduction, • The contents of Conclusions and Future Work may, or may not, be mentioned in the Abstract, or Introduction; if they are not mentioned, then no derivations from these sections are possible, • The Conclusions and Future Work are also derived from the Results and Discussion, as the results are always mentioned in the concluding remarks, and they assist in deﬁning the future work as well, e.g., if the users are not satisﬁed with the aforementioned adaptive technology, then the future work could include a re-design of this technology, • The References can be derived from all of the following sections: Introduction, Related Work, Paper Body, Experimental Methodology, Results and Discussion, and Conclusions and Future Work, • The References can also be derived from the Abstract (as shown in Fig. 9); however, quoting references in the Abstract is not a standard publication documentation practice, • The Keywords are derived from all the other sections, as they represent words that can be related to every other section, • Each of the authors of a publication controls the process of their collaboration,

• Each section (Abstract, Introduction etc.) can be generated by this collaborative process, e.g., if a publication has 3 authors, the Paper Body can be documented by one author, while the Introduction can be documented by two authors etc. We note that the aforementioned relationships present our own understanding of a typical publication documentation process, and we have veriﬁed them through extensive knowledge and background of a group of researchers. 13 We will now enlist the provenance information (in the form of queries) that can be provided by the graphs in Figs. 9 and 10: • • • • • • • • •

Which author(s) worked on a given section (Abstract, Introduction etc.)? Is a given section documented? What is the total documentation time for a given section? How many times was a given section revised? What is the total revision time for a given section? Which sections were documented by one author? Which sections were documented by more than one author? Which section was revised the least (most) number of times? Which section took the smallest (largest) amount of time for documentation? • How many sections are related to the Paper Body? • How many references were quoted by the author(s)? • How many keywords were speciﬁed by the author(s)?

13 We selected 15 researchers from the Center for Research in Ubiquitous Computing (CRUC), the research group of the authors. More details about the group can be found at http://cruc.khi.nu.edu.pk/.

18

T. Mahmood et al. / Computer Standards & Interfaces 35 (2013) 6–29

Fig. 11. Provenance graph of Content layer.

• What were the keywords speciﬁed by the author(s)? • What are the contents of the Abstract section? Similar to the CFP layer, we will not expound on these selfexplanatory queries, but we will answer some of them in our case study (Section 6). 5.2.3. Content layer We recall from Section 1 that the Content layer provides more speciﬁc provenance data related to the research paper artifact shown in Fig. 8. For this, the relevant provenance data of the research paper from the Document layer is passed to the Content layer (labeled “Research Paper Data” in Fig. 2), e.g., the title of the paper, the documented sections, total number of sections etc. Then, the Content layer provides provenance data related to the different research contents that are associated with this research paper, i.e., the different concepts that characterize the research work documented in this paper. We have identiﬁed six different research contents, i.e., Limitations of State Of The Art (SOTA), Proposal of Idea, Description of Idea, Comparison with SOTA, Methodology Logistics, and Validation/Proof of Idea.14 We illustrate them as artifacts in the provenance graph shown in Fig. 11, and describe them as follows: 1. Limitations of SOTA: This represents the limitation(s), or drawback(s), of the state-of-the-art work, concerning the domain of the publication, e.g., “no state-of-the-art E-Commerce system uses only neural networks for the problem of recommending adapted products to the user”, 2. Proposal of Idea: This represents a brief summary of the idea associated with the novel contribution presented in the publication, e.g., “an E-commerce portal that employs neural network classiﬁers in order to adapt the interaction for online users by recommending them their desired products”, 3. Description of Idea: This represents a complete description of the idea associated with the novel contribution, e.g., the description of the design diagram of the adapted E-Commerce system, 4. Comparison with SOTA: This represents a comparison of the novel idea with its related domain work (as speciﬁed by the SOTA), e.g., demonstrating how the adapted E-Commerce system is different/ better than each of the SOTA E-Commerce systems, 14 Similar to the sections of a publication, we have validated these contents through members of our CRUC research group.

5. Methodology Logistics: This represents logistical information related to the experimental methodology, e.g., in the evaluation of the adapted E-Commerce system, specifying the duration of the experiment, the type of experiment (e.g., online evaluation with real users), the different phases of methodology (e.g., recruitment of online users, managing the experiment, analysis of results etc.) and 6. Validation/Proof of Idea: This represents logistical information related to the results of the experiments (or to the proof of some algorithm), e.g., showing how the results (obtained from the evaluation of the adapted E-Commerce system) validate the novel idea. Having described the artifacts, we will now describe how they are related to each other in Fig. 11. • The Proposal of Idea is derived from the Limitations of SOTA, because a novel research idea of a given domain, almost always, addresses the drawbacks of the SOTA of this domain, • The Description of Idea is derived from the Proposal of Idea, as it describes the idea's proposal in more detail, • The Comparison with SOTA is derived from the Limitations of SOTA, as the comparison is based on the related works identiﬁed in Limitations of SOTA, • The Comparison with SOTA is also derived from both the Proposal of Idea and Description of Idea, because summaries and details of the novel idea are required in order to compare them with the related work, • The Methodology Logistics is derived from the Description of Idea, because, generally speaking, describing the idea assists in determining the method of its validation, • The Methodology Logistics is also derived from the Comparison with SOTA, because analyzing how the related work has been evaluated could assist in determining the novel idea's method of validation, and • The Validation/Proof of Idea is derived from the Methodology Logistics, because executing the methodology leads to results, and hence, to the (non-)validation of some idea. We recall from Section 1 that the Content layer provides more speciﬁc information, as compared to the Document layer. In this context, we will now illustrate how different sections of a publication are associated with different types of research contents. We model these associations through provenance graphs. For the sake of clarity, we illustrate separate provenance graphs for the following sections: Abstract, Introduction, Paper Body, Experimental Methodology, Results and Discussion,

T. Mahmood et al. / Computer Standards & Interfaces 35 (2013) 6–29

19

Fig. 12. Provenance graph for Abstract.

and Conclusions and Future Work. Moreover, in our opinion, two types of research contents could be associated with a given section: 1. Primary research content, i.e., one whose presence within this section is mandatory, and 2. Secondary research content, i.e., one whose presence within this section is not mandatory. In the provenance graphs, we annotate the derivations for the primary and secondary research content with the label “primary” and “secondary” respectively. The provenance graph for Abstract is shown in Fig. 12. In the Abstract, the authors primarily introduce a given domain, deﬁne its limitations, propose their novel idea to cater for these limitations, and mention their experimental results. Hence, in Fig. 12, we derive the Abstract from three primary research contents Limitations of SOTA, Proposal of idea, and Validation/Proof of Idea (annotated with the label “primary”). Occasionally, some authors might also compare their idea with the SOTA within the Abstract, or mention minute details related to the experimental methodology. Hence, in Fig. 12, we also derive the Abstract from the research contents Comparison with SOTA and Methodology Logistics. As these derivations are occasional, we have labeled them as “secondary” in Fig. 12. We note that, as the Abstract presents an extremely brief summary of the paper, the length of each of its research content is kept very small, e.g., 1–2 sentences. The provenance graph for Introduction is shown in Fig. 13. In fact, the Introduction is a brief summary of the whole paper. Hence, in our point of view, it should include all the research contents, i.e., limitations of the given domain, deﬁnition of the novel idea, brief comparison with the SOTA, along with brief descriptions of the idea, the experimental methodology and the results. Thus, in Fig. 13, we have

derived the Introduction from all of our six research contents, and as mentioned above, we believe that all these derivations are mandatory (labeled as “primary”). The provenance graph for Paper Body is shown in Fig. 14. As the Paper Body builds upon the proposal of the novel idea, and describes it in detail, hence we derive it from the research contents Proposal of Idea and Description of Idea, with both derivations being primary. Moreover, in some papers, the Paper Body could also refer to the SOTA, in order to exactly explain how the novel idea is an improvement over the limitations of the SOTA. Hence, we derive the Paper Body also from the Limitations of SOTA, and label this derivation as “secondary”. The provenance graph for Experimental Methodology is shown in Fig. 15. This section basically describes all issues related to the selection of experimental setting, and the execution of experiments. Hence, we primarily derive it from the research content Methodology Logistics. Moreover, we believe that the description of the idea is important in determining the experimental setting, e.g., a novel architecture for an E-Commerce system can be best validated through an online evaluation with real users. Hence, we derive Experimental Methodology also from the research content Description of Idea. Finally, in some cases, it is possible that a publication's related work could also assist in determining the experimental setup, e.g., by surveying experimental approaches of E-Commerce technologies, we can decide which of the approaches is most suitable for the validation of our E-Commerce architecture. Hence, we derive Experimental Methodology from the research content Comparison with SOTA (labeled as “secondary”). The provenance graph for Results and Discussion is shown in Fig. 16. The results discussed in this section either validate or non-validate the idea. Hence, we derive this section from the research content Validation/Proof of Idea (label “primary”). Also, this section describes the

Fig. 13. Provenance graph for Introduction.

20

T. Mahmood et al. / Computer Standards & Interfaces 35 (2013) 6–29

Fig. 15. Provenance graph of Experimental Methodology. Fig. 14. Provenance graph of Paper Body.

results according to the experimental setting, e.g., if the experiment is divided into two phases, then the results for both the phases should be discussed. Thus, we derive this section also from Methodology Logistics (label “primary”). Finally, this section can also (occasionally) demonstrate how the results address the limitations of the SOTA, and hence, improve the SOTA. Hence, we derive this section from the research content Limitations of SOTA (label “secondary”). Having illustrated the provenance graphs of the Content layer, we will now describe the provenance information that can be provided by this layer. In our opinion, this information can be extremely useful for the reviewers, during the Paper Review process. For instance, an important review criterion is whether a given section contains all the required research contents (or not)? For instance, Abstract and Introduction are incomplete if they don't summarize the results of the paper, and the Related Work is incomplete if it doesn't thoroughly compare the novel idea with the SOTA. The possible provenance queries related to this requirement are as follows:

Along with documenting provenance information, these queries will allow reviewers to determine precisely the content of each section, and hence, acquire an overall rating for the paper's content. In other words, the Content layer documents provenance information that can assist the reviewers in their tasks. So, this information can help us to understand, for instance, why a given publication was accepted (or rejected), the criteria which formed the basis of this decision etc. It is important to note that, although we have proposed our hierarchy based on our research background and experience, we have not standardized it in any way. Speciﬁcally, other researchers or research ventures have the ﬂexibility to modify our hierarchy by adding (or deleting) entities, in order to suit their own needs. Also, all provenance data enlisted in this section is tracked by our three tracking agents (Section 5.1). For the sake of clarity, in Figs. 3–7, we have shown only a subset of this data. 5.3. Proﬁling layer

Does a given section contain all of its primary research contents? Does a given section contain a given primary research content? Does a given section contain one or more secondary research contents? Does a given section contain a given secondary research content? Which research contents are (not) present in a given section?

The Proﬁling layer (Fig. 2) can be considered as a by-product of our provenance hierarchy. It consists of three types of proﬁles, or provenance models, which can be generated by our hierarchy, i.e., the research venture proﬁle, the author proﬁle, and the reviewer proﬁle. We describe them in the following sections.

So, for instance, if the idea described in the Paper Body is somewhat different (not derived) from the one proposed in the Introduction, then this indicates incoherency. A similar situation can arise if the Results and Discussion is not derived from the Experimental Methodology, e.g., if the experiment is conducted in two phases, then the results for both the phases should be presented. Besides this, the following provenance queries (related to the Content layer) can also assist the reviewers in their work 15:

5.3.1. Research venture proﬁle The provenance data generated by the CFP layer (refer to Section 5.2.1) assists in building a research venture proﬁle, i.e., a provenance model for a given research venture. A possible provenance graph for the research venture proﬁle is shown in Fig. 17. Here, the artifact Research Venture Proﬁle derives the following artifacts (information) about a given research venture:

• • • • •

• Does the proposed idea appear to be novel? • Is the novel idea described clearly (unambiguously) in the paper body? • Does the description of the novel idea address the limitations of the SOTA? • Is the comparison with related work enough, or the authors should have compared more related papers? • Is the selection of the experimental methodology justiﬁed? • Do the results present enough proof in order to validate the novel idea? • Do the authors mention the limitations of the novel idea? • What are the primary issues regarding the research contents? • Is the use of English language appropriate? • Should the paper be accepted or rejected? 15 In fact, the reviewers can exploit these queries for their work, before the provenance can be documented.

• • • • • • •

•

Name and contact information, Date at which the research venture is going to be held, Venue and URL of the research venture, Statistics related to the CFP, e.g., date on which the CFP was announced, the number of recipients to which the CFP was emailed etc., List of the organizations (or institutions) sponsoring the research venture, Relevant data related to the best paper awards, e.g., the amount of prize money and the gifts given to the winners, Relevant data related to any travel grants being offered by the research venture, e.g., the amount of grant, the number of participants who won the grant etc., Data about the different research tracks, poster sessions, and workshops being held at the research venture, e.g., name of tracks, number of participants of the poster sessions and workshops, the topic of the workshop etc.,

T. Mahmood et al. / Computer Standards & Interfaces 35 (2013) 6–29

21

Fig. 16. Provenance graph for Results and Discussion.

• Statistics related to the review process, i.e., total time taken for this process, the number of papers assigned to each reviewer (on the average) etc., • Statistics related to the submission of research papers, e.g., total number of submissions, submission rate at speciﬁed time intervals etc., • Data related to the keynote speakers and the program committee members, • The acceptance rate, and the number of registered participants.

However, this information doesn't assist in determining the value of a research venture, based on concrete provenance information like what we have presented in the CFP layer.

If we compare proﬁles of different research ventures, we can acquire useful information, e.g.,

5.3.2. Reviewer proﬁle The provenance data from all the three layers of our hierarchy (Sections 5.2.1–5.2.3) assists in building a reviewer proﬁle, i.e., a type of provenance model about a given reviewer. A possible provenance graph for the reviewer proﬁle is shown in Fig. 18. Here, the artifact Reviewer Proﬁle derives the following artifacts (information) about a given reviewer:

• The conferences which attract more participants (e.g., more than 500), • The conferences in which a given person was a program committee member, a program chair, or a keynote speaker, • The competitive conferences, e.g., those with an acceptance rate of less than 20%, • The conferences associated with a particular publisher etc.

• • • • •

In fact, publishers like ACM and IEEE do maintain logistical data related to their research ventures, e.g., the name and location of the conference, details about this location, the scope of the conference etc.

Name, afﬁliation, designation and email, A list of research interests, Acceptance rate, e.g., the rate at which the reviewer accepts papers, Rejection rate, e.g., the rate at which the reviewer rejects papers, A list of research ventures at which the reviewer has previously reviewed papers, • The time since the reviewer is reviewing, or reviewing experience, • The section(s) typically reviewed by the reviewer; these can be extracted from the reviewer's textual comments, through natural

Fig. 17. Provenance graph for the research venture proﬁle.

22

T. Mahmood et al. / Computer Standards & Interfaces 35 (2013) 6–29

Fig. 18. Provenance graph for the reviewer proﬁle.

language processing. They could provide reasons as to why the author accepted/rejected a paper. However, this information might be incomplete because the reviewer might not comment on every section she has reviewed, and • The research contents (of one or more sections) veriﬁed by the reviewer during her review process; these can be also extracted through natural language processing. 16 If we compare the proﬁles of different reviewers, we can know which sections and which research contents each deems important for review, their acceptance and rejection rates, the type of issues identiﬁed per paper, e.g., typically, “the reviewer has issues regarding the selection and design of the experimental methodology”, “the reviewer hardly comments on the related work” etc. This type of data regarding the reviewing behavior can assist in selecting reviewers for a given research venture. 5.3.3. Author proﬁle The provenance data from all the three layers of our hierarchy (Sections 5.2.1–5.2.3) assists in building an author proﬁle, i.e., a type of provenance model about a given author. A possible provenance graph for the author proﬁle is shown in Fig. 19. Here, the artifact Author Proﬁle derives the following artifacts (information) about a given author: • Name, afﬁliation, designation and email, • A list of research interests, • Acceptance data, i.e., data related to the author's papers which have been accepted and published, e.g., the titles and keywords of these papers, rate of acceptance of the author's submitted papers, • Rejection data, i.e., data related to the author's papers which have been accepted and published, e.g., the titles and keywords of these papers, rate of rejection of the author's submitted papers, • Statistics related to the documentation process, e.g., the total time taken by the author while documenting, and revising, the paper, the maximum number of revisions etc. • A list of research ventures at which the author has previously submitted papers,

16 As we have previously mentioned, we do not present the technical details of our system in this paper. Thus, we will not expound on how we are going to use natural language processing to extract this reviewing information.

• The time since the author is submitting research papers, or authoring experience, • The section(s) typically documented by the author, and • The research contents (of one or more sections) typically documented by the author. This data conveys a structured view of a given author which can assist in evaluating her worth in a given research community, e.g., by addressing queries such as “does the author submit to competitive conferences?”, “what's the author's success rate (in terms of paper acceptance)?”, “which authors share similar research interests?” etc. Similar to the provenance data of our hierarchy, we have not standardized the aforementioned proﬁles. Other researchers can add (or delete) content from our proﬁles in order to adapt to their research needs. In the next section, we will present an example of these three proﬁles in a case study. 5.4. Provenance data store and SPARQL Query Tool The provenance data generated by PADS is stored and updated in a central provenance store called the provenance data store, or PDS (label “Update PDS” in Fig. 2). The PDS stores this data as ontological instances. Speciﬁcally, we use OWL17 to model an explicit ontology for our provenance data. We implement this ontology in the Protégé tool18 (version 3.3.1). Fig. 20 shows six classes in our ontology, which are instances of the generic owl:Thing class. We describe them as follows: 1. CFPPData, i.e., the class representing all instances of the provenance data generated by the CFP layer, 2. DocumentPData, i.e., the class representing all instances of the provenance data generated by the Document layer, 3. ContentPData, i.e., the class representing all instances of the provenance data generated by the Content layer, 4. RVProﬁle, i.e., the class representing all the proﬁles of different (instances of) research ventures, 5. RWProﬁle, i.e., the class representing all the proﬁles of different (instances of) reviewers, and 6. ATProﬁle, i.e., the class representing all the proﬁles of different (instances of) authors.

17 18

http://www.w3.org/TR/owl http://protege.stanford.edu/

T. Mahmood et al. / Computer Standards & Interfaces 35 (2013) 6–29

23

Fig. 19. Provenance graph for the author proﬁle.

Moreover, Fig. 21 shows the following ﬁve “Object” properties, which show the relationships or links between these classes:

of the research venture, the amount of this grant, and its name, respectively.

1. usesCFPData, i.e., DocumentPData uses the provenance data of CFPPData, 2. usesDocData, i.e., ContentPData uses the provenance data of DocumentPData, 3. generatesRVProﬁle, i.e., for a given research venture, the provenance data from CFPPData is used to generate the RVProﬁle, 4. generatesRWProﬁle, i.e., for a given research venture, the provenance data from CFPPData, DocumentPData, and ContentPData is used to generate the RWProﬁle, 5. generatesATProﬁle, i.e., for a given research venture, the provenance data from CFPPData, DocumentPData, and ContentPData is used to generate the ATProﬁle.

For a given research venture, we create instances of our six classes and store them in the PDS. So, the PDS is a collection of ontological instances related to the tracked research ventures. Such a setting ensures semantic interoperability amongst provenance data generated through different applications of our proposed provenance system. For instance, different provenance data of CFP layer, or different reviewer proﬁles, can be easily compared with each other. We allow users to query our provenance data through the SPARQL language (label “Access PDS” and “Query” in Fig. 2). Currently, we are employing the SPARQL Query Tool, which is built-in within Protégé. Later on, we plan to employ the Protégé API to build a GUI to facilitate querying for the users.

Finally, in Fig. 22, we show some of the “Datatype” properties belonging to each of our six classes, 19 i.e., members of the classes. We deﬁne them as follows: • For the class CFPPData, AcceptRate depicts the acceptance rate of the research venture, TotReviews represents the total number of reviews submitted by the reviewers, and RVLocation represents the location of the research venture, • For the class DocumentPData, RevTimeIntro represents the total time taken for revising the Introduction, and NumKeywords and NumReferences represents the number of keywords and references speciﬁed by the authors, respectively, • For the class ContentPData, PrimContIntro depicts whether (or not) all the primary contents have been mentioned in the Introduction section, EngApp depicts whether (or not) the English Language has been appropriately used (or not), and NovelIdea depicts whether the idea depicted in the paper is novel (or not), • For the class ATProﬁle, ATSince, ATEmail, and ATIntroDoc represent the time since the author is writing papers, the author's email, and whether (or not) the author documents the Introduction, respectively, • For the class RWProﬁle, RWTime, RWAccRate, and RWAfﬁl represent the time since the reviewer is reviewing papers, the reviewer's personal rate of accepting his/her reviewed papers, and the reviewer's afﬁliation, respectively, and • For the class RVProﬁle, TravGrantAllot, TravGrantAmnt, and TravGrantName represent the number of grants allotted to the participants

6. Experiments with PADS In this section, we conduct some basic experiments in order to test and validate the functionality of PADS. We document and track provenance data related to three conferences (research ventures) within PADS. We list them as follows: 1. HT09, 20 i.e., the twentieth ACM Conference on Hypertext and Hypermedia, which was held in Torino, Italy, in 2009, 2. ICICT11, 21 i.e., the fourth International Conference on Information and Communication Technologies, which was held in Karachi, Pakistan, in 2011, and 3. ICIET10,22 i.e., the second International Conference on Information and Emerging Technologies, which was held in Karachi, Pakistan in 2011. All three conferences have already been held, and we have previously submitted papers at these conferences which have also been accepted. For testing purposes, we decided to re-document these papers within PADS. The rationale is that, for these conferences and papers, we are currently knowledgeable about a majority of the data that needs to be tracked within PADS, e.g., related to the conference, contents of each section, the conference etc. Also, it's not possible to track all of our provenance data for a paper that is currently being documented for an upcoming conference. 20 21

19

All the datatypes are not shown for the sake of brevity.

22

www.ht2009.org http://icict.iba.edu.pk/ http://www.khi.nu.edu.pk/ieee/iciet2010/

24

T. Mahmood et al. / Computer Standards & Interfaces 35 (2013) 6–29

Fig. 20. The six classes comprising our Ontology; PData = provenance data, AT = author, RV = research venture, RW = reviewer, CFP = Call For Papers.

We have documented one paper for HT09, labeled “Doc_HT09”, two papers for ICICT11, and one paper for ICIET10. In Sections 6.1–6.3, we illustrate snapshots of some of the provenance data related to Doc_HT09, which was obtained through executing several SPARQL queries in Protégé. It has been grouped as data obtained from the CFP, Document and the Content layers, along with the proﬁle for HT09 and for one of the authors of this paper. Also, the

provenance data obtained is quite comprehensive, so we show only a subset of data in this section. In Section 6.4, we will demonstrate the use of the SPARQL Query Tool within Protégé, and we will comment brieﬂy on the time taken to execute SPARQL queries. Finally, in Section 6.5, we enlist some possible recommendations, obtained from the queried data, which can be used to manage upcoming research ventures.

Fig. 21. Object properties linking the six classes of our Ontology.

T. Mahmood et al. / Computer Standards & Interfaces 35 (2013) 6–29

25

Fig. 22. Some datatype properties of the six classes in our Ontology.

6.1. Provenance data for the CFP layer In this section, we present some of the provenance data related to the CFP layer. Initially, we present the proﬁle for HT09, whose data has been gleaned from the CFP layer (Section 5.3.1). This is shown in Fig. 23. It provides data according to the provenance graph shown in Fig. 17. This data is quite self-explanatory. We enlist some mentionable data as follows: • We list names of only three program committee members, • Three types of paper awards are available, which we list along with the surnames of their respective winners, • No travel grants were offered, • Three tracks were organized, • There were 7 accepted demos, and 25 accepted posters (we don't show the names of winners here), • The total review time was around 1.5 months, • The acceptance rate is 20%, • The CFP was dispatched to 75 recipients.

If we query different research proﬁles (as shown in Fig. 23), the answers can assist us in assigning a value (rank) to a research venture, based on the mentioned parameters, e.g., those with more registered participants, or less acceptance rate etc. Moreover, the CFP layer can provide provenance data related to our selected publication. This data is shown in Fig. 24, and comparison of such data across multiple research ventures can answer useful queries such as: • The research ventures that have been registered by a particular author, • The research ventures in which a given author (with his/her afﬁliation) has submitted a paper, • The number of submitted papers in which the title contains one or more given keywords, e.g., “recommender system”, “adaptivity”, “e-commerce” etc., • The respective conferences at which such papers were submitted etc. 6.2. Provenance data for the Document layer In this section, we present the provenance data related to the Document layer, which is shown in Fig. 25. It answers some of the queries mentioned in Section 5.2.2. For instance, the documented sections are Paper Body, Experimental Methodology, Results and Discussion, Abstract, Introduction, Conclusions and Future Work, and References and Keywords. All these sections have been documented by the author Tariq Mahmood. Also, it took 5 and 4 days to document the Paper Body and the Related Work respectively, while the Results and Discussion took the maximum time to be documented (7 days). Similarly, it took 2 days and 1 day to revise the Paper Body and Related Work respectively, while the number of revisions for both sections was 2. Moreover, the Results and Discussion was revised a maximum number of times (8) and the Experimental Methodology a minimum number of times (1). Fig. 25 also provides the keywords and the abstract (as they are documented in the paper), and shows that a total of 23 references are used by the authors.

Fig. 23. Research venture proﬁle for Hypertext 2009.

Fig. 24. Some provenance data for CFP layer.

26

T. Mahmood et al. / Computer Standards & Interfaces 35 (2013) 6–29

Fig. 27. Provenance data for the Content layer (ﬁrst review of the selected paper).

Fig. 25. Provenance data for Document layer.

At this point, we also display the author proﬁle for one of the authors, i.e., Tariq Mahmood, shown in Fig. 26. Besides the demographic data of the author (at the time of writing the paper), it conveys the following information: • The list of the authors' research interests, • 75% of the author's papers have been accepted, and the other 25% have been rejected, • The author has been writing research papers since 1st December, 2004, • The author typically documents all the sections, i.e., Abstract, Introduction, Experimental Methodology, Results and Discussions, References and Keywords, and Conclusions and Future Work • Within the section, the author documents the Limitations of the SOTA, Proposal of Idea, Methodology Logistics, and Validation/Proof of Idea, Description of the Idea, and Comparison with SOTA, • For the given paper, the author took 1.3 h to document and revise the Abstract, and 3.5 h to document and revise the Introduction.

reviewers' decisions. In this context, we illustrate the provenance data for the three reviews given for our selected paper. These reviews were given by three different reviewers, and we have extracted the information from these reviews and shown them in Figs. 27, 28, and 29 respectively. Let us initially consider Fig. 27. Here, the ﬁrst reviewer has determined that the Paper Body and Experimental Methodology contain all their primary research contents. Also, the Introduction contains all its primary research contents, with the exception of Methodology Logistics, and the novel idea is validated in the Results and Discussion. The Experimental Methodology also contains the secondary research content Comparison with SOTA. In short, only the aforementioned information (regarding the research contents) was important for the ﬁrst reviewer. Also, according to this reviewer, the idea presented in the paper was novel, it was described thoroughly, the selection of the experimental methodology was justiﬁed, comparison with related work was comprehensive, and the use of English language was appropriate. All these criteria led the reviewer to accept the paper, notwithstanding the fact that a part of this paper has been already published, and that a few typos are present in the paper. Now, let us consider Fig. 28, which illustrates provenance data for the second review. Similar to the ﬁrst reviewer, the second one believed that the idea is novel, it has been described thoroughly, the selection of

We were not able to acquire any information regarding the activities of the reviewers for HT09. Hence, we cannot illustrate any reviewer proﬁle for this conference. In fact, we have reviews given by three different reviewers (which we describe in the next section), but we don't have information regarding the provenance graph of the reviewer proﬁle (Fig. 18). 6.3. Provenance data for the Content layer Finally, we present the provenance data related to the Content layer. In fact, as mentioned in Section 5.2.3, this data can facilitate the reviewing process, and can help us in understanding the criteria for the Fig. 28. Provenance data for the Content layer (second review of the selected paper).

Fig. 26. Author proﬁle for the author Tariq Mahmood.

Fig. 29. Provenance data for the Content layer (third review of the selected paper).

T. Mahmood et al. / Computer Standards & Interfaces 35 (2013) 6–29

the experimental methodology is justiﬁed, the related work is comprehensive, the paper has been clearly presented, and enough proof has been presented to justify the idea. Although this reviewer had more issues with the paper than the ﬁrst one, he still accepted the paper. Finally, let us consider Fig. 29, which illustrates provenance data for the third review. Similar to the ﬁrst and second reviewers, the third one only believed that the idea has been described thoroughly, and the English language is appropriate. Besides this, according to the reviewer, the idea is not completely novel, the related work is quite limited, and the experimental methodology is ﬂawed (the issues in this regard are also listed). On the whole, the reviewer has rejected the paper. 6.4. Demo — use of SPARQL within Protégé In this section, we will demonstrate the use of the SPARQL Query Tool, which we have employed in order to query our provenance data. In the ﬁrst query, shown in Fig. 30, we retrieve the three instances of the class CFPPData, one for each of our selected conferences. The query tool is shown in the red box toward the left, and the results in the red box toward the right. The retrieved instances are CFP_HT09, CFP_ICICT11 and CFP_ICIET10 for HT09, ICICT11 and ICIET10 respectively. In the second query, shown in Fig. 31, we retrieve some data related to the proﬁle of our conferences. The queried parameters are the acceptance rate, the location, and the total number of reviews submitted by the reviewers. The results show that HT09 is most competitive with an acceptance rate of 20%. Also, the maximum number of reviews (403) was submitted at ICIET10, while the same information for HT09 is not available (N/A). We added more variables to the second query, in order to acquire all the provenance data related to the proﬁles of each research venture. We also executed similar queries for the author and reviewer proﬁles, as well as for each layer of our provenance hierarchy. For all these queries, we obtained the relevant results without any apparent delay in the execution of queries. We don't show these results here, in order to

27

avoid including a large number of snapshots. Also, we have currently documented only four research papers in PADS. Later on, we plan to document a large number of other papers and measure the query execution times for each of our queries. We intend to show these results in an extension of this journal paper. 6.5. Possible recommendations by PADS As mentioned in Section 5, the proﬁles for the research ventures, authors and the reviewers can be used to make useful recommendations for upcoming ventures. Referring to the provenance data listed in Sections 6.1–6.3, we will now mention some possible recommendations. Let's assume that some university is organizing the Hypertext 2013 conference. Then, for the domain of adaptive conversational recommender systems, we can recommend: • A possible set of authors to whom the CFP should be dispatched (extracted from author proﬁles), • A possible set of reviewers (extracted from reviewer proﬁles), • A possible acceptance rate to be applied (extracted from research venture proﬁles), • The average number of participants that can be expected to attend the conference (extracted from both research venture and author proﬁles), • The expected average documentation time per submission (extracted from the author proﬁles), • Number of papers to be assigned to each reviewer, and the allotted time for submitting the review (extracted from reviewer proﬁles and applicable for any domain), and • The publishing organization to be contacted for publishing the papers (extracted from research venture proﬁles and applicable for any domain). We conjecture that such type of information can be extremely beneﬁcial for the organizers in efﬁciently and effectively managing novel research ventures.

Fig. 30. SPARQL Query to retrieve three CFPPData instances.

28

T. Mahmood et al. / Computer Standards & Interfaces 35 (2013) 6–29

Fig. 31. SPARQL Query to retrieve data related to the research venture proﬁle.

7. Conclusions and future work The term “data provenance” commonly refers to the documented history of a digital artifact, e.g., the buying history of a painting. More recently, it has been used in the domain of scientiﬁc experimentation, in order to provide information that can basically assist in the re-creation of the experiments. In this paper, we have applied data provenance to the more generic domain of scientiﬁc publication (research paper). Speciﬁcally, we have designed and implemented a provenance-aware system for documenting publications, labeled as PADS. It implements a provenance hierarchy which consists of three layers (from general to speciﬁc): 1) the CFP Layer, which provides provenance data related to a given research venture, 2) the Document layer, which provides provenance data related to the structure of the documented publication (submitted in the given venture), and 3) the Content layer, which provides provenance data related to the contents of the this publication. Moreover, this provenance data can assist in the generation of three types of provenance models, or proﬁles: 1) proﬁle for the research venture, 2) proﬁle for the author who has submitted a publication at this research venture, and 3) proﬁle for the reviewer who has reviewed one or more of the submitted papers. In order to model this data, we employ the standard Open Provenance Model (OPM) speciﬁcation (version 1.1), and we represent it as ontological instances (through the OWL language) within the Protégé editor. This allows semantically inter-operability, i.e., provenance data for different research ventures could be easily compared with each other. We implement a tracking mechanism in order to gather this provenance data, both manually and automatically. We allow users to query this data through the SPARQL query language. We performed some basic experiments with PADS, by documenting four publications in PADS, related to three conferences. We showed the provenance data generated by our hierarchy, as well as the proﬁle for a conference and an author. We also showed that SPARQL queries can be executed efﬁciently in PADS. Finally, we enlisted some recommendations, extracted from our provenance data, which can be useful for managing upcoming research ventures. As our future work, we are currently working to enhance the natural language processing module, which we use to extract information

about the contents of a documented paper and the contents of the reviews provided by the reviewers (Section 5.1). Along with this, we plan to further develop and implement the set of recommendations currently output by PADS, e.g., as a recommender system. This system would be part of the PADS architecture, and would allow the conference organizers to pose diverse types of queries in order to manage upcoming ventures. We also plan to test the performance of PADS in a robust manner, by documenting a considerable number of publications, and measuring the time taken to execute SPARQL queries to access our provenance data. Moreover, we plan to identify the exact link between scientiﬁc experimentation and scientiﬁc publication, i.e., how our hierarchy can be precisely used to support and guide the provenance of scientiﬁc experimentation. Finally, we are working to develop a provenance model for the domain of (scientiﬁc) book publication. Acknowledgments This work is sponsored in part by the Higher Education Commission of Pakistan, and by the Center for Research in Ubiquitous Computing (CRUC), NUCES, Karachi, Pakistan, to which the authors are associated.

References [1] B. Glavic, K. Dittrich, Data provenance: a categorization of existing approaches, Datenbanksysteme in Business, Technologie und Web (BTW 2007), Lecture Notes in Informatics (LNI), Aachen, Germany, 2007. [2] Y.L. Simmhan, P. Beth, G. Dennis, A survey of data provenance in e-science, ACM SIGMOD Record 34 (3) (Sep. 2005) 31–36. [3] P. Groth, S. Jiang, S. Miles, S. Munroe, V. Tan, S. Tsasakou, L. Moreau, An Architecture for Provenance Systems, ECS, University of Southampton, 2006 No. [4] R.D. Stevens, A.J. Robinson, C.A. Goble, myGrid: personalised bioinformatics on the information grid, Bioinformatics (Oxford, England) 19 (1) (2003) 302–304. [5] J. Frey, D. De Roure, K. Taylor, J. Essex, H. Mills, E. Zaluska, CombeChem: a case study in provenance and annotation using the Semantic Web, Lecture Notes in Computer Science 4145 (2006) 270 No. [6] A. Bouguettaya, M. Hepburn, Q. Liu, K. Xu, J. Zhang, Bio-SENSE: semantic web services for bioinformatics provenance, 2008 CSIRO ICT Centre Conference, 2008, Sydney, Australia. [7] M.L.P. Groth, L. Moreau, A protocol for recording provenance in service-oriented grids, Proceedings of the 8th International Conference on Principles of Distributed Systems (OPODIS 2004), Grenoble, France, December 2004, No.

T. Mahmood et al. / Computer Standards & Interfaces 35 (2013) 6–29 [8] I. Foster, J. Vöckler, M. Wilde, Y. Zhao, Chimera: a virtual data system for representing, querying, and automating data derivation, Proceedings of the 14th International Conference on Scientiﬁc and Statistical Database Management, 2002. [9] Phillip Lord, Pinar Alper, Chris Wroe, Robert Stevens, Carole Goble, Jun Zhao, Duncan Hull, M. Greenwood, The Semantic Web: service discovery and provenance in myGrid, W3C Workshop on Semantic Web for Life Sciences, 2004. [10] J. Widom, Trio: A System for Integrated Management of Data, Accuracy and Lineage, CIDR, Asilomar, CA, 2005. [11] I. Wood, J. Larson, H. Gardner, A vision and agenda for theory provenance in scientiﬁc publishing, Database Systems for Advanced Applications, Springer, 2009, pp. 112–121. [12] L. Moreau, Ben Clifford, Juliana Freire, Joe Futrelle, Yolanda Gil, Paul Groth, Natalia Kwasnikowska, Simon Miles, Paolo Missier, Jim Myers, Beth Plale, Yogesh Simmhan, Eric Stephan, J.V.d. Bussche, Open provenance model v 1.1, Future Generation Computer Systems 27 (6) (June 2011) 743–756. [13] J. Hunter, K. Cheung, Provenance explorer—a graphical interface for constructing scientiﬁc publication packages from provenance trails, International Journal on Digital Libraries 7 (1) (2007) 99–107. [14] S. Miles, P. Groth, M. Branco, L. Moreau, The requirements of using provenance in e-science experiments, Journal of Grid Computing 5 (1) (2007) 1–25. [15] S. Miles, S.C. Wong, W. Fang, P. Groth, K.P. Zauner, L. Moreau, Provenance-based validation of e-science experiments, Web Semantics: Science, Services and Agents on the World Wide Web 5 (1) (2007) 28–38. [16] O. Hartig, Provenance information in the web of data, The 2nd Workshop on Linked Data on the Web (LDOW2009), WWW '09, Madrid, Spain, 2009. [17] K.K. Muniswamy-Reddy, D.A. Holland, U. Braun, M. Seltzer, Provenance-aware storage systems, Annual Conference on USENIX '06 Annual Technical Conference Boston, MA, USENIX Association Berkeley, CA, USA, 2006. [18] Y.L. Simmhan, B. Plale, D. Gannon, Query capabilities of the Karma provenance framework, Concurrency and Computation 20 (5) (2008) 441. [19] Y.L. Simmhan, B. Plale, D. Gannon, Karma2: provenance management for data-driven workﬂows, International Journal of Web Services Research 5 (2) (2008) 1–22. [20] L. Moreau, B. Ludascher, I. Altintas, R.S. Barga, S. Bowers, S. Callahan, G. Chin Jr., B. Clifford, S. Cohen-Boulakia, The ﬁrst provenance challenge, Concurrency and Computation: Practice and Experience 20 (5) (2007) 409–418. [21] L. Moreau, S. Miles, P. Missier, Y. Simmhan, J. Futrelle, J. Myers, E. Stephan, N. Kwasnikowska, J. Van den Bussche, J. Freire, The Open Provenance Model (v1. 1), 2009 No. [22] S. Russell, P. Norvig, Artiﬁcial Intelligence: A Modern Approach, 3rd ed. Prentice Hall, 2009. [23] J. Han, M. Kamber, Data Mining: Concepts and Techniques, 2nd ed. Morgan Kaufmann Publishers, 2006.

Dr. Tariq Mahmood is currently an Assistant Professor in the Department of Computer Science, at the National University of Computer and Emerging Sciences (NUCES), located in Karachi, Pakistan. His research interests include Data Mining, Web Mining, Artiﬁcial Intelligence, Provenance, Ontology, and Semantic Web. He currently has 4 journal publications and numerous conference publications to his credit. He completed his PhD from the University of Trento, in Italy, in 2009. His PhD dissertation was concerning the application of Machine Learning techniques to develop interactive behavior strategies for product recommender systems. Moreover, he completed his MS degree from the well-reputed Universite Pierre et Marie Curie (UPMC), in Paris, France, in 2004.

29

Dr. Syed Imran Jami received his B.S. in Computer Science from the University of Karachi in 2000, M.S. in Computer Science from Lahore University of Management Sciences in 2004 and Ph.D. in Computer Science from the National University of Computer & Emerging Sciences in 2011. He is one of the founding members of Center for Research in Ubiquitous Computing and associated with it since 2006. He also worked with Haptics Research Lab and Pervasive and Networked Systems Research Group at Deakin University, Australia. Imran Jami has authored 6 journal papers and 8 conference papers based on his doctoral dissertation. He is also a member of several professional organizations including ACM, IAENG, SIAM etc.

Dr. Zubair A. Shaikh received his MS and Ph.D. degrees in Computer Science from Polytechnic University, New York, USA, in 1991 and 1994 respectively. He completed his BE (Computer Systems) from Mehran University of Engineering and Technology, Jamshoro, Pakistan in 1989. He is the Professor and Associate Dean of Faculty of Computer Science and Information Technology at the National University of Computer and Emerging Sciences, Karachi, Pakistan. He has published more than 70 papers in international conferences and journals. His research interests include Ubiquitous Computing, Artiﬁcial Intelligence, Social Network, Human Computer Interaction, Wireless Sensor Networks and Data Provenance.

Mr. Muhammad Hussain Mughal received his MS in Computer Science from National University of Computer and Emerging Sciences in 2010 and B.E (Software) from Mehran University of Engineering and Technology, Pakistan in 2007. His research interests include Databases, Data Mining, Machine Learning, Provenance, Ontology, and Semantic Web. He is associated with Center for Research in Ubiquitous Computing since 2006. He currently authored 2 research papers in journal and conference of international repute.

Toward the modeling of data provenance in scientific publications

Toward the modeling of data provenance in scientific publications

Recommend Documents