Towards a content agnostic computable knowledge repository for data quality assessment

Towards a content agnostic computable knowledge repository for data quality assessment

Accepted Manuscript Towards a Content Agnostic Computable Knowledge Repository for Data Quality Assessment Naresh Sundar Rajan , Ramkiran Gouripeddi ...

1MB Sizes 0 Downloads 30 Views

Accepted Manuscript

Towards a Content Agnostic Computable Knowledge Repository for Data Quality Assessment Naresh Sundar Rajan , Ramkiran Gouripeddi , Peter Mo , Randy K. Madsen , Julio C. Facelli PII: DOI: Reference:

S0169-2607(18)30625-4 https://doi.org/10.1016/j.cmpb.2019.05.017 COMM 4926

To appear in:

Computer Methods and Programs in Biomedicine

Received date: Revised date: Accepted date:

1 May 2018 16 April 2019 17 May 2019

Please cite this article as: Naresh Sundar Rajan , Ramkiran Gouripeddi , Peter Mo , Randy K. Madsen , Julio C. Facelli , Towards a Content Agnostic Computable Knowledge Repository for Data Quality Assessment, Computer Methods and Programs in Biomedicine (2019), doi: https://doi.org/10.1016/j.cmpb.2019.05.017

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Hightlights 

We identified research gaps in data quality literature towards automating DQA methods.



In this process, we designed, developed and implemented a computable data quality knowledge repository for assessing quality and characterizing data in health data repositories. In this process, we leveraged service-oriented architecture towards a scalable, reproducible

AC

CE

PT

ED

M

AN US

framework in disparate biomedical data sources.

CR IP T



ACCEPTED MANUSCRIPT

Title: Towards a Content Agnostic Computable Knowledge Repository for Data Quality Assessment Author names and affiliations: NARESH SUNDAR RAJAN1 , RAMKIRAN GOURIPEDDI1, PETER MO1 , RANDY K. MADSEN 1 , JULIO C. FACELLI1 Department of Biomedical Informatics, Center for Clinical and Translational Sciences (CCTS)

CR IP T

1

Biomedical Informatics Core, University of Utah, 421 Wakara Way, Suite 140, Salt Lake City, Utah 84108-3514, USA.

AN US

Email addresses: Naresh Sundar Rajan: [email protected] Ramkiran Gouripeddi: [email protected] Peter Mo: [email protected]

Julio C. Facelli: [email protected]

M

Randy K. Madsen: [email protected]

ED

Corresponding Author: Julio C. Facelli, 801-581-4080

Abstract

PT

Present/permanent Address: 421 Wakara Way, Suite 140, Salt Lake City, Utah 84108-3514

CE

Background and Objective: In recent years, several data quality conceptual frameworks have been proposed across the Data Quality and Information Quality domains towards assessment of quality of data.

AC

These frameworks are diverse, varying from simple lists of concepts to complex ontological and taxonomical representations of data quality concepts. The goal of this study is to design, develop and implement a platform agnostic computable data quality knowledge repository for data quality assessments.

ACCEPTED MANUSCRIPT

Methods: We identified computable data quality concepts by performing a comprehensive literature review of articles indexed in three major bibliographic data sources. From this corpus, we extracted data quality concepts, their definitions, applicable measures, their computability and identified conceptual relationships. We used these relationships to design and develop a data quality meta-model and

CR IP T

implemented it in a quality knowledge repository. Results: We identified three primitives for programmatically performing data quality assessments: data quality concept, its definition, its measure or rule for data quality assessment, and their associations. We modeled a computable data quality meta-data repository and extended this framework to adapt, store,

AN US

retrieve and automate assessment of other existing data quality assessment models.

Conclusion: We identified research gaps in data quality literature towards automating data quality assessments methods. In this process, we designed, developed and implemented a computable data quality knowledge repository for assessing quality and characterizing data in health data repositories. We

M

leverage this knowledge repository in a service-oriented architecture to perform scalable and reproducible

ED

framework data quality assessments in disparate biomedical data sources. Keywords: Data Quality Metadata Repository; Knowledge Representation; Data Quality Assessment;

AC

CE

PT

Data Quality Dimensions; Data Quality Framework;

ACCEPTED MANUSCRIPT

1. INTRODUCTION Several data quality frameworks (DQF) and evaluation methodologies have been proposed in recent years in the domains of Data Quality (DQ) and Information Quality (IQ). Such methodologies have a wide range of applications, from sensor-based systems to clinical decision support systems, and biomedical

CR IP T

data stores for research [1][2]. Biomedical research inclusive of health services, observational, comparative effectiveness, clinical trials, public health, genomic and exposomic studies is increasingly dependent on the secondary use of existing data including electronic health record (EHR), as sources for knowledge discovery [3][4]. Moreover, such studies often require large-scale and multi-site federation

AN US

and integration of data from semantically and syntactically heterogeneous sources for selection of cohorts and performing study analysis [5][6].

Health data exhibit quality issues such as incompleteness, inaccuracy, and lack of appropriateness [7][8]. Assessing quality of data is key for enabling appropriate use and study analysis as well as understanding

M

and quantifying limitations of clinical and translational studies. Such efforts require expensive tools and human resources [9]. Using consensus-based and standardized data quality assessment (DQA) methods

ED

could ensure community acceptance [10], transparency and consistency of DQ results [11], leading to a better understanding of study limitations, quantification of uncertainties and reproducibility of associated

PT

results.

CE

Proper use of different consensus-based and standardized DQA methods and limiting their subjective misrepresentation requires a computable store of data quality concepts (DQC) and their computation

AC

methods. Such a store would provide a pragmatic methodology for implementing DQA in heterogeneous and distributed environments, and in addition support generation and reuse of quality metadata by different systems and users for diverse purposes [12]. Existing DQF are mainly conceptual representations of concepts and their methods of assessment [2][7][13]. The DQF put forth by Wang and Strong consists of 15 DQC classified into four high level categories of DQ - Intrinsic, Contextual, Representational and Accessibility [14]. Kahn et al. proposed a

ACCEPTED MANUSCRIPT

fit-for-use conceptual model for single-site and multisite biomedical research and consists of Intrinsic and Conceptual categories of DQC [2]. These categories are further classified into domains that have technical and clinical definitions and descriptions. The DQF also provides DQ rules that are unassigned to particular DQC or their domains. The DQF developed by Weiskopf and Weng consists of five DQC that

CR IP T

have weighted associations with one or more of seven categories of measurement methods [7]. Almutiry et al. have developed a dimension-oriented taxonomy consisting of four DQC (Accuracy, Consistency, Completeness, and Timeliness) that have ten, thirteen, five and two measurement methods respectively [15]. The DQ ontology described by Johnson et al. is a measures-oriented representation consisting of

AN US

four high-level concepts (CorrectnessMeasure, ConsistencyMeasure, CompletenessMeasure, and

CurrencyMeasure) which are further classified into 19 low-level measures [16]. More recently, Kahn et al. have harmonized some of these EHR DQF into three DQ categories: Conformance, Completeness, and Plausibility, and two DQA contexts: Verification and Validation [17]. Detailed descriptions of these DQF

M

can be found in the supplementary material. These DQF meet DQA needs of specific domains and consist of DQC requiring human interpretation and are not necessarily computable in a software framework.

ED

The purpose of this study is to investigate a content and platform agnostic approach for DQA by discovery of primitives associated with computable DQC, and to design and implement a computable

PT

Quality Knowledge Repository (QKR) capable of storing any DQF and their methods of computation. Such a QKR can be leveraged in a framework as proposed in [18] to programmatically assess DQ of

CE

disparate and heterogeneous data.

AC

2. METHODS

In the first part of the methods section, we describe the discovery of primitives associated with programmatic DQA. For this we identify key computable DQC by performing a comprehensive literature search and analyzing the retrieved literature for the DQC definitions, measures and associations to build a DQC meta-model. In the second part of this section we describe the design and development of a QKR that stores conceptual data quality models for use in a computable platform.

ACCEPTED MANUSCRIPT

2.1. Data Quality Concept Information Extraction We followed a seven-step process to discover DQ primitives and computable DQC and its measures

AC

CE

PT

ED

M

AN US

CR IP T

(Figure 1).

Fig. 1. Discovery of DQ primitives and identification of computable DQC.

ACCEPTED MANUSCRIPT

2.1.1.

Literature Search

With the help of a Research Librarian, we identified the following keywords that are relevant to DQC and their methods of assessment: Data Quality Model, Data Warehouse Data Model, Secondary Data Quality Model, Data Quality Models in Data Warehouse, Database Quality Reporting Model and Data Quality

CR IP T

Reporting Model. Using these keywords, we performed a systematic literature review [19] of three major bibliographic sources (Google Scholar, PubMed, and SCOPUS) for articles that were included in these resources before February 2016 (Supplementary Figure. 1) and identified 1163 non-unique articles from which we retained 940 unique articles in English language. The first author screened these articles for

AN US

their relevance to DQC based on their title and abstract, which resulted in 89 articles relevant to this project. A further full text analysis of these articles reduced the total number of articles relevant to this work to 40. Excluded articles included those that did not discuss DQC or DQF in their titles or abstracts

2.1.2. DQC Identification

M

in the screening phase and full-text in the eligibility phase of the literature review.

ED

In this paper we use DQC when referring to any DQ term. The first author reviewed full texts of the selected publications and manually extracted DQC present in each article. The second author then

PT

validated this selection and any discrepancies on identification of DQC were resolved after discussions with all the authors.

CE

2.1.3. DQC Definition Extraction By an iterative manual review, we identified definitions for each DQC. We first identified and tabulated

AC

an International Standards Organization (ISO) based definition if available [20][21]. For those DQC that did not have an ISO based definition, we traversed the references cited in the articles to the first articles with original definitions. Using this process, we extracted definitions for all the identified DQC in step 2.1.2.

ACCEPTED MANUSCRIPT

2.1.4. Measurement Method Extraction A DQC may have a measurement method, which is the logic by which a concept can be computed. In our review of the literature we found that a DQC could have multiple methods of assessment. We adopted the most frequently used assessment methods for the DQC as their measurement methods similar to work

CR IP T

done in previous survey studies [23][7].

2.1.5. DQC Computability Classification

Using the extracted definitions and their measurement methods (when applicable), we classified the 70 concepts as those requiring subjective or objective methods of assessments [13]. Subjective DQC are

AN US

stakeholders‟ perceptions and viewpoints about the quality of data and therefore non-computable.

Objective DQC are further classified as task independent, which are generic and measurable states of data irrespective of the task at hand, and task-dependent, which are interpretations of task-independent concepts along with other auxiliary information [22]. We identified task-independent concepts and

ED

2.1.6. Finalization of DQC

M

consolidated semantically similar concepts by retaining those that subsume similar concepts.

We reviewed the DQC for presence of measurement methods, the absence of which indicates the DQC is

PT

not computable. The six DQC we extracted had one or more methods of measurements.

CE

2.1.7. Discovery of DQ Primitives In the final step we identified primitives and their relationships associated with computable DQC in order

AC

to develop a conceptual representation.

ACCEPTED MANUSCRIPT

2.2. Design and Development of QKR 2.2.1. Design: Conceptual Representation of DQF We then used the standard metadata modeling approach of the Dublin Core Metadata Initiative [24] to

CR IP T

represent these primitives and their relationships as a meta-model consisting of abstract DQ constructs.

2.2.2. Development: Logical Model of OpenFurther’s Metadata Repository

In order to implement this meta-model, we adopted the OpenFurther Metadata Repository (MDR)

[25][26] to develop the QKR. The MDR is an Object Modeling Group (OMG) standard-based repository

AN US

of artifacts and knowledge about these artifacts. It stores metadata artifacts for data subscribed by data integration and knowledge presentation platforms [25][27]. These artifacts include but are not limited to: (1) logical data models and model associations, (2) administrative information, (3) descriptive information, and (4) translation programs. These artifacts are referred to as "Assets" in a custom-built

M

highly generic and abstracted entity relationship model. Assets may have properties and associations to other assets. Associations can have properties of their own and each association has a Type, which is also

ED

an asset. Stored metadata can be shared in various structured and non-proprietary formats (e.g. XML) using translation programs and made available for consumption via different software services. Adopting

PT

the MDR to develop the QKR, allows ease of orchestration with our data federation and integration platform, OpenFurther [5][6], or consumption of its content by any other service oriented architecture

AC

CE

(SOA) based platform.

ACCEPTED MANUSCRIPT

3. Results 3.1. Data Quality Concept Information Extraction Results Our literature search resulted in 40 articles discussing DQ and DQF conceptually (Supplementary #2, 40

CR IP T

Articles Analysis). We identified 70 semantically non-unique DQC that were present in at least one of the 40 articles considered here (Supplementary #3, DQC Information Extraction Results.xlsx; Sheets 1, 2, and 3). We also extracted definitions for each of these DQC by tracing their provenance. By retaining task-independent concepts and consolidating semantically similar concepts (Supplementary #3, DQC

AN US

Information Extraction Results.xlsx; Sheets 1, 2, and 3), we identified six unique computable DQC from DQ literature (Table I) to form a computable DQF (cDQF). We associated each of these DQC to multiple valid measures and their descriptions based on their presentation in literature. DQC are multidimensional objects consisting of three primitives: a concept name, its definition and methods of computation that can

AC

CE

PT

ED

M

be consumed by a query execution engine.

ACCEPTED MANUSCRIPT

Table I. Computable DQC extracted from literature along with their Definitions, and Measures (represented as Measure Codes and Descriptions) forming cDQF.

Accuracy

Definition The extent to which data are correct, reliable, and certified free of errorb .

Measure Code Accuracy1

Accuracy2 Accuracy3

Completeness

AN US

Accuracy4

The extent to which data are of sufficient breadth, depth, and scope for the task at handc

Completeness1

PT

ED

M

Completeness2

AC

CE

Concordance

Consistency

The data is concordant when there was agreement or compatibility between data elementsc

The degree to which data has attributes that are free from contradiction and are coherent with other data in a specific context of use c

Completeness3 Completeness4 Completeness5

Measure Description Syntactic Accuracy = Number of correct values/number of total valuesa Number of accurate tuplesa Comparing with Gold Standardb User Surveya

CR IP T

DQC

Number of not null values/total number of valuesa Number of tuples delivered/Expected numbera Comparing with Gold Standardb User Surveya Data Density: Number of values per unit of time or per event.

Concordance1

Agreement between data elements of same sourcesb

Concordance2

Agreement between data elements of other sourcesb

Consistency1

Number of consistent values/number of total valuesa

ACCEPTED MANUSCRIPT

Consistency2

Consistency3

The degree to which data has attributes that are of the right age in a specific context of use c

Currency

Currency1

Currency2 Currency3

AN US

Currency4

Data contains no redundant valuesa,d

Redundancy

CR IP T

Consistency4

Number of tuples violating constraints, number of coding differencesa With respect to same format type acrossb User Surveya Time in which data are stored in the system - time in which data are updated in the real worlda Time of last update a Request time- last updatea User Surveya Number of duplicatesa

Redundancy1

Note : Literature sources for DQC definitions and measures: a=[23], b=[7], c=[14], d=[169].

M

3.2. Quality Knowledge Repository

Using the Dublin Core Metadata Initiative metadata modeling standard [24], we conceptually represented

ED

a DQC within a DQF as a Concept entity [28]. Each of these Concepts are associated with primitives of a Definition, and zero to many Measures of Computation (Figure 2). A DQF is represented as an abstract

PT

container of a namespace within the QKR (Figure 2). The namespace serves as a unique identifier for the

CE

DQF and all elements of a DQF associated within the given namespace. Each DQC belonging to a particular DQF is represented as an asset (Asset:DQC) belonging to the corresponding namespace. The

AC

related entities of DQC, Definitions and Measures are stored as associated assets: Asset:Definition and Asset:Measure, respectively. An Asset:DQC can be associated with multiple Asset:Measures.

AC

CE

PT

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

Fig. 2. Conceptual (above) and physical (below) representations of DQC within the QKR. The conceptual model consists of each DQC represented as a Concept with related Definition and Measures. In the physical model these entities are stored as assets for Concepts, Definitions and Measures with appropriate Associations.

ACCEPTED MANUSCRIPT

To demonstrate generalizability of the QKR, we identified DQC, their definitions and measures in five other DQF [2][7][14][15][16] and mapped them to the QKR‟s conceptual model (Figure 2 and Figure 3). We then successfully loaded them as assets, and associations into the QKR. We present the DQ models and the assets for cDQF and the five other DQF in supplementary materials (Supplementary #4, cDQF

CR IP T

and other DQF in QKR).

4. Discussion

We performed an exhaustive literature search for DQC covering a broad spectrum of domains ranging from sensors to electronic health record data. Our results included a significant number of false positives.

AN US

Of the 940 unique articles in English, only 89 (9.47%) articles passed through the screening process for full text review. This high level of attrition is due to existing ambiguity in DQ literature [7] and the lack of inclusion of the DQ domain in standardized subject headings thesauri or classification systems such as Medical Subject Heading or ACM Computing Classification System [29]. Previous work has called for

M

consistency and consensus in the nomenclature, definitions and use of various DQ constructs [7][11][23][30]. Inclusion of subject headings relevant to DQ into thesauri or classification systems used

ED

for indexing literature would not only improve literature search results but also have implications in the practice of DQ [11][29] . Nevertheless, our results demonstrate that using literature review it is possible to

CE

DQA [18].

PT

identify a compressive list of computable DQC that can be used in a software ecosystem for automated

In order to meet the design requirements for automated DQA, the DQC we identified needed to meet two

AC

criteria. First, the identified DQC should be agnostic to the content of a specific data domain allowing for a software platform leveraging them to function with any user-selected data. And second, we needed the DQC to be computable using measures that can be programmatically called by software code and not require human interpretations. Using methods described by Pipino et al. [22] and the DQCs‟ definitions and measures, we classified them as computable if DQC did not require subjective assessments and were independent of the task. Using this process, we developed cDQF - a DQF consisting of six semantically

ACCEPTED MANUSCRIPT

unique, computable, content and platform agnostic DQC along with a definition for each and 19 computable measures. Our purposes for developing cDQF were to (1) discover DQ primitives that can utilized to model DQF and DQC in a computable store, and (2) have a set of computable DQC for our next steps of developing and evaluating a SOA-based DQA architecture.

CR IP T

Our findings on comparing cDQF with the others corroborate current understanding in the field of DQ (Table II). We found four groups of DQC: those with identical, synonymous, non-identical, and multiple definitions. The DQC Accuracy and Currency are present in all the DQF. Completeness and Consistency are not supported in DQF proposed by Kahn et al [2]. In addition to cDQF Concordance is supported only

AN US

in the DQF proposed by Weiskopf and Weng [7] although it also subsumes the DQC Consistency in this case. The DQC Redundancy is not supported in any of the other DQF. By construction only DQC with computable measures are included in cDQF. Our results show that Accuracy, Completeness, Consistency, and Currency are the most represented DQC, which is consistent with the basic set of required DQC as

M

defined by Batini et al [23]. DQF have variable representations and each of these have been developed for specific domains and tasks. Our findings provide evidence for the need for having a knowledge repository

AC

CE

PT

ED

that can store different DQF, which can then be used for automated DQA.

ACCEPTED MANUSCRIPT

Table II. Computable DQC in cDQF with other DQF

Concordance

Absent

Present (nonidentical)

Present (Identical)

Present (Identical)

Present (Identical)

Present (Identical)

Absent

Absent

Represented as Concordance

Represented partly as Representational Consistency (Subset)

Present (Identical)

Represented Present as Timeliness (Identical) (Synonymous)

Represented as Timeliness (Synonymous)

Represented as Timeliness (Synonymous)

Absent

Absent

Absent

Absent

PT

CE

Currency

Redundancy

Absent

Note : Identical - Concepts that are exactly same with no

AC

Almutiry et al. [15]

Present (Identical)

Absent

Represented under multiple categories (4) (multiple definitions) Represented under multiple categories (3) (multiple definitions) Absent

Wang and Strong [14]

Represented as Correctness (Synonymous)

ED

Consistency

Represented under multiple Present categories (Identical) (9) (multiple definitions) Represented under multiple categories Absent (6) (multiple definitions)

Weiskopf and Weng [7]

CR IP T

Completeness

Kahn et al. [2]

AN US

Accuracy

Johnson et al.[16]

M

cDQF

different in definition.

Non-identical - Concepts that has same name but similar or overlapping definition. Multiple definitions - Concepts are represented multiple times in the framework, with multiple definitions. Subset - Concepts are further classified into categories. For example, Consistency consists of Representation consistency, Domain consistency, Coding consistency, and Domain metadata. Synonymous - Concepts are grouped under same category. Absent - Concepts are simply not present. Accuracy and Currency are represented in all the conceptual models with

ACCEPTED MANUSCRIPT

identical, non-identical, synonymous and multiple definitions.

We discovered key primitives for DQA and developed a DQ conceptual model representing them and their relations using the Dublin Core Metadata Initiative metadata modeling standard [24]. We then

CR IP T

developed and demonstrated QKR as a computable store of these primitives and their relationships that can made available for various software frameworks. In addition to using the QKR as a computable store for cDQF, we demonstrated its extensibility by adapting five other DQF. This generalizability of the QKR allows storage of any DQF including those having subjective DQC. Computable measures of DQC are

AN US

stored as a resource URL (Figure 3) in the QKR and can be leveraged in an SOA-based query engine for

AC

CE

PT

ED

M

automated DQA.

Fig. 3. An example architecture leveraging the QKR to perform programmatic and automated DQA of large scale, multi-site heterogeneous health data. Users select DQC and methods from DQF stored in the QKR. OpenFurther leverages QKR‟s rest services to consume users‟ selections and DQ content to orchestrate DQA workflows for selected data. A next version of the QKR will store appropriate visualizations for different DQC that can be leveraged by the Visualization Meta-Framework.

ACCEPTED MANUSCRIPT

By design we have separated the storage of DQA content and rules from the semantic particulars of a dataset or data source, and the DQA processes that require query engines and orchestration. This abstracted approach to DQA and management of DQA content give DQ practitioners flexibility in how they implement their DQ software platforms. The Quality Services are a microservices based architecture

CR IP T

that established a service for each DQF using the content of the QKR, and interrogates OpenFurther‟s federated query engine [5][6] for semantic and data model transformations (Figure 3). OpenFurther stores the data semantics in a terminology server (TS), and the data model transformations in a metadata

repository (MDR) (Figure 3). It orchestrates input queries and output results (in this case DQA queries

AN US

and results) by performing model transformations and semantic translations using content in the MDR

AC

CE

PT

ED

M

and TS respectively. We describe an example of an SOA implementation using the QKR in Figure 4.

AC

CE

PT

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

Fig. 4. An example SOA implementation leveraging the QKR. Here, (1) a user or system provides a DQA query, in this case „Completeness of Age for Females with Diabetes Mellitus in an OMOP Repository using the CDQF framework and Completeness1 measure‟. The Quality Services (2) consume the input query, and interrogates the QKR (3) for DQ content and the OpenFurther federated query engine (4) for orchestrating semantic translations and data model transformations. For example, administrativeGender for female which is a SNOMED concept in the OpenFurther query language is translated to OMOP‟s concept for female gender. Since the Completeness1 measure is the ratio of null values and total number of values, two queries are generated: one for all females with diabetes mellitus, and a second for all

ACCEPTED MANUSCRIPT

females with diabetes mellitus with age being null. The quality services then compute the measure based on the rule stored in the QKR and generates a result (5). While subjective DQC present in the five DQF are stored in the QKR, they cannot be leveraged programmatically as they require human input to generate DQ assessments, and the QKR provides as a knowledge repository [27] for them. Thereby, the QKR serves to provide a (1) shared understanding of

CR IP T

DQC in different DQF, (2) templates for possible results for subjective DQC, (3) inter-DQC relationships where computable DQC are required to make subject DQC assessments, and (4) a chain of provenance in SOA for users‟ choices of DQC and DQF and obtained DQA results.

The contents of the QKR are openly available through the means of RESTful services at

AN US

https://qmdr.ccts.utah.edu/qmdr/{URI}. Users and systems can leverage these services to interrogate the contents of QKR, export in CSV and JSON formats, utilize them in software ecosystems and build graphical user interface tools on top of these services. All code and technical descriptions about the QKR are available on GitHub (https://github.com/openfurther/further-open-qkr). As a part of our future work,

M

we plan to evaluate the data model using a system usability scale [30] and assisted survey interviews with

ED

DQ experts for evaluating this DQA modeling and the QKR. Detailed documentation on how to map a DQF to QKR is provided in the supplementary material (Supplementary #5; Mapping a DQF to QKR -

PT

Examples.docx).

While there are several other IQ based frameworks identified in literature [238], we chose five specific

CE

frameworks that pertain to DQA and are relevant to health data. The QKR as such is generic enough to accommodate DQF in IQ domain. Similarly, the QKR can accommodate secondary DQC (those that

AC

require prior computation of primary DQC) and analytical concepts such as those required for clinical phenotyping (e.g. sensitivity, specificity, precision and recall) [31]. As a future step, we plan to adopt DQF relevant to Big Data [32] as next versions of the OpenFurther platform integrate biomedical data into Big Data stores [33]. Data cleansing methods [34] can also be represented in computable formats in the QKR similar to the DQF.

ACCEPTED MANUSCRIPT

Recent efforts have harmonized different DQF relevant for EHR data [17], and have started to yield interDQF DQC mappings. This harmonized DQF and inter-DQF mappings can be stored in the QKR as its own namespace and as inter-model associations respectively, and then be utilized by the OpenFurther platform for translating DQA across different DQF. While the current scope of this harmonized DQF is

CR IP T

limited to EHR data [17], and future versions of this DQF could include other important translational research data such as genomic and exposomic, our agnostic approach would support these different

versions based on the needs of an implementation. Similarly, the QKR along with OpenFurther can be used to translate the DQ content to other metadata specification standards such as ISO, RDF and XML

AN US

[35][36][37]. The current version of the QKR uses a relational store, we are working on graphical

approaches using distributed multi-model graph database (OrientDB) to support easier authoring and viewing of the content. While these are inherent advantages of leveraging the OpenFurther MDR due to its simplistic meta-model consisting of assets and their associations, the conceptual model of DQ

M

primitives can be implemented in other metadata stores.

Using a computable store of DQF such as the QKR could lead to the proper use of different consensus-

ED

based and standardized DQA methods. As a knowledge repository it would allow comparisons and translations between different DQF. When leveraged in SOA-based data ecosystem, it would provide a

PT

trajectory of the resulting DQA; the provenance of which could associated with comprehensive information stored in the QKR. Such metadata could make DQA results more believable to end users

CE

saving costs and resources needed for repeat DQA. These DQA automation steps provide a foundation on

AC

which transparency and consistency of DQ results can be ensured [11] and ultimately reproducibility of translational research [41]. Implementers of the QKR have the flexibility of loading content from one or more DQF as well as any inter-DQF mappings. A major limitation of cDQF, is by design it includes only six DQC. While this might be an oversimplification of DQA, our goal in this effort was discover DQ primitives and to identify DQC that are computable and can be utilized by software platforms. Similarly, the assessment for DQC of other

ACCEPTED MANUSCRIPT

DQF stored in the QKR can be programmatically generated only if they are computable. Nevertheless and as discussed earlier, subjective concepts are stored in the QKR as references for use by DQA practitioners and software systems downstream. Next, the modeling and implementation of the other DQF need further review by their owners which we will accomplish through evaluation methods described above. Finally,

CR IP T

the value of this approach for DQA will be demonstrated only through an actual evaluation using a realworld problem. For that we leverage QKR with a SOA-based data integration and federation orchestration platform, OpenFurther [5][6] to programmatically and automatically perform DQA of health data [18] (Figure 3). In this implementation, users select DQC and their methods of computations from their choice

AN US

of DQF stored in the QKR. Through a set of DQ services, OF orchestrates DQA query workflows by consuming users‟ selections, DQ content from the QKR and its existing data transformation/translating capabilities to perform DQA across large heterogeneous and disparate stores of data. Results of this

5.

M

implementation will be presented in future work.

Conclusion

ED

Current approaches to DQA are resource intensive. In order to automate DQA methods we discovered key DQ primitives and their relationships we develop a computable store, QKR. In this process, we also

PT

identified computable DQC for assessing quality of data. We demonstrate the generalizability of QKR by

CE

storing different DQF including their DQC, their associated definitions and measures. The QKR serves as a computable knowledge repository and its content can be consumed by SOA platforms [18] for

AC

automating DQA in heterogeneous and disparate data environments - which will be its true evaluation. This approach for representing quality concepts in a computable format will be a first step for performing scalable DQA and in ensuring reproducibility of translational research.

ACCEPTED MANUSCRIPT

Conflict of Interest The authors declare that they do not have any conflict of interests. JCF and RG are partners in T-REx Informatics, but this paper does not contain any reference to products or services from T-REx Informatics.

CR IP T

Acknowledgements

Funding: The research reported in this publication was supported in part by the National Center for Advancing Translational Sciences of the National Institutes of Health under Award Number

AN US

UL1TR001067. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Naresh Sundar Rajan was partially supported by the Richard A. Fay and Carole M. Fay Endowed Graduate Fellowship for the Department of Biomedical Informatics in honor of Homer R. Warner, MD, PhD. Computational infrastructure and resources were

ED

M

provided by the Utah Center for High Performance Computing.

Research data for this article: The data used in this article consist of the citations retrieved in

PT

the bibliographic search. The list of all references considered here is given in the supplementary

AC

CE

material.

ACCEPTED MANUSCRIPT

References [1]

I.-G. Todoran, L. Lecornu, A. Khenchaf, J.-M. Le Caillec, A Methodology to Evaluate Important Dimensions of Information Quality in Systems, J. Data Inf. Qual. 6 (2015) 11.

[2]

M.G. Kahn, M.A. Raebel, J.M. Glanz, K. Riedlinger, J.F. Steiner, A Pragmatic Framework for

CR IP T

Single-site and Multisite Data Quality Assessment in Electronic Health Record-based Clinical Research, Med. Care. 50 (2012). doi:10.1097/MLR.0b013e318257dd67. [3]

W.R. Hersh, M.G. Weiner, P.J. Embi, J.R. Logan, P.R.O. Payne, E. V Bernstam, H.P. Lehmann,

AN US

G. Hripcsak, T.H. Hartzog, J.J. Cimino, Caveats for the use of operational electronic health record data in comparative effectiveness research, Med. Care. 51 (2013) S30. [4]

H.R. Warner, J.D. Morgan, High-density medical data management by computer, Comput. Biomed. Res. 3 (1970) 464–476. doi:http://dx.doi.org/10.1016/0010-4809(70)90008-X. R. Gouripeddi, P.B. Warner, P. Mo, J.E. Levin, R. Srivastava, S.S. Shah, D. de Regt, E.

M

[5]

Kirkendall, J. Bickel, E.K. Korgenski, Federating clinical data from six pediatric hospitals: process

ED

and initial results for microbiology from the PHIS+ Consortium, in: AMIA Annu. Symp. Proc.,

[6]

PT

American Medical Informatics Association, 2012: p. 281. R. Gouripeddi, D.N. Schultz, R.L. Bradshaw, P. Mo, R. Butcher, R.K. Madsen, P.B. Warner, B.

CE

LaSalle, J.C. Facelli, FURTHeR: An Infrastructure for Clinical, Translational and Comparative Effectiveness Research, (2013). http://knowledge.amia.org/amia-55142-a2013e-1.580047/t-10-

AC

1.581994/f-010-1.581995/a-184-1.582011/ap-247-1.582014.

[7]

N.G. Weiskopf, C. Weng, Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research., J. Am. Med. Inform. Assoc. 20 (2013) 144–51. doi:10.1136/amiajnl-2011-000681.

[8]

A.L. Nobles, K. Vilankar, H. Wu, L.E. Barnes, Evaluation of data quality of multisite electronic

ACCEPTED MANUSCRIPT

health record data for secondary analysis, in: Big Data (Big Data), 2015 IEEE Int. Conf., IEEE, 2015: pp. 2612–2620. [9]

HIMSS, 2013 Annual Report of the U.S. Hospital IT Market - HIMSS, (2013). https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved

CR IP T

=0ahUKEwjJ2vu_-

K_MAhVC6mMKHSr5C_wQFgggMAA&url=http%3A%2F%2Fapps.himss.org%2Ffoundation% 2Fdocs%2F2013HIMSSAnnualReportDorenfest.pdf&usg=AFQjCNHarHJthNQqi9Wg3Hc11fAs wuOgoA&sig.

T.J. Callahan, J.G. Barnard, L.J. Helmkamp, J.A. Maertens, M.G. Kahn, Reporting Data Quality

AN US

[10]

Assessment Results: Identifying Individual and Organizational Barriers and Solutions, eGEMs (Generating Evid. Methods to Improv. Patient Outcomes). 5 (2017) 16. doi:10.5334/egems.214. M.G. Kahn, J.S. Brown, A.T. Chun, B.N. Davidson, D. Meeker, P.B. Ryan, L.M. Schilling, N.G.

M

[11]

Weiskopf, A.E. Williams, M.N. Zozus, Transparent reporting of data quality in distributed data

[12]

ED

networks, EGEMS (Washington, DC). 3 (2015) 1052. doi:10.13063/2327-9214.1052. A.P. Chapman, A. Rosenthal, L. Seligman, The Challenge of “Quick and Dirty”

C. Batini, M. Scannapieco, Data Quality: Concepts, Methodologies and Techniques (Data-Centric

CE

[13]

PT

Information Quality, J. Data Inf. Qual. 7 (2016) 1–4. doi:10.1145/2834123.

Systems and Applications), Springer-Verlag New York, Inc., 2006. R.Y. Wang, D.M. Strong, Beyond accuracy: what data quality means to data consumers, J. Manag.

AC

[14]

Inf. Syst. 12 (1996) 5–33. http://dl.acm.org/citation.cfm?id=1189570.1189572 (accessed January 17, 2014).

[15]

O. Almutiry, G. Wills, R. Crowder, A dimension-oriented taxonomy of data quality problems in electronic health records, IADIS Int. J. WWW/Internet. 13 (n.d.) 98–114.

ACCEPTED MANUSCRIPT

http://eprints.soton.ac.uk/384258/. [16]

S.G. Johnson, S. Speedie, G. Simon, V. Kumar, B.L. Westra, A Data Quality Ontology for the Secondary Use of EHR Data, AMIA Annu. Symp. Proc. 2015 (2015) 1937–1946. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765682/. M.G. Kahn, T.J. Callahan, J. Barnard, A.E. Bauck, J. Brown, B.N. Davidson, H. Estiri, C. Goerg,

CR IP T

[17]

E. Holve, S.G. Johnson, A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data, Egems. 4 (2016).

N. Sundar Rajan, R. Gouripeddi, J.C. Facelli, A Service Oriented Framework to Assess the Quality

AN US

[18]

of Electronic Health Data for Clinical Research, in: Healthc. Informatics (ICHI), 2013 IEEE Int. Conf., IEEE, 2013: p. 482 [19]

A. Liberati, D.G. Altman, J. Tetzlaff, C. Mulrow, P.C. Gøtzsche, J.P.A. Ioannidis, M. Clarke, P.J.

M

Devereaux, J. Kleijnen, D. Moher, The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration, PLoS

[20]

ED

Med. 6 (2009) e1000100.

ISO/IEC, Systems and software engineering - Systems and software Quality Requirements and

ISO/IEC, Software engineering - Software product Quality requirements and evaluation

CE

[21]

PT

Evaluation (SQuaRE), (2011). http://www.iso.org/iso/catalogue_detail.htm?csnumber=35733.

(SQuaRE)-data quality model, Ginebra Int. Organ. Standarization. (2008).

AC

http://www.iso.org/iso/catalogue_detail.htm?csnumber=35736.

[22]

L.L. Pipino, Y.W. Lee, R.Y. Wang, Data quality assessment, Commun. ACM. 45 (2002) 211–218.

[23]

C. Batini, C. Cappiello, C. Francalanci, A. Maurino, Methodologies for data quality assessment

and improvement, ACM Comput. Surv. 41 (2009) 1–52. doi:10.1145/1541880.1541883. [24]

ISO, ISO 15836:2009 - Information and documentation -- The Dublin Core metadata element set,

ACCEPTED MANUSCRIPT

(2009). https://www.iso.org/standard/52142.html (accessed April 29, 2018). [25]

R.L. Bradshaw, S. Matney, O.E. Livne, B.E. Bray, J. a Mitchell, S.P. Narus, Architecture of a federated query engine for heterogeneous resources., AMIA Annu. Symp. Proc. 2009 (2009) 70–4.

=abstract. [26]

CR IP T

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2815441&tool=pmcentrez&rendertype

P. Mo, R.L. Bradshaw, R. Butcher, R. Gouripeddi, P.B. Warner, R.K. Madsen, B. LaSalle, C. Julio Facelli, N.D. Schultz, Real-time Federated Data Translations using Metadata-driven XQuery, AMIA CRI Spring 2014. (2014). http://knowledge.amia.org/amia-56636-cri2014-1.977698/t-004-

[27]

AN US

1.978136/a-089-1.978209/a-089-1.978210/ap-085-1.978211.

R.L. Bradshaw, C.J. Staes, G.D. Fiol, S.P. Narus, J.A. Mitchell, Going FURTHeR with the Metadata Repository., in: Annu. Symp. Proc. Am. Med. Informatics Assoc., 2012. P. Chen, Entity-relationship modeling: historical events, future trends, and lessons learned, in:

M

[28]

Softw. Pioneers, Springer, 2002: pp. 296–310. W.R. Hogan, M.M. Wagner, Accuracy of data in computer-based patient records, J. Am. Med.

ED

[29]

Inform. Assoc. 4 (n.d.) 342–355.

[30]

CE

bstract.

PT

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=61252&tool=pmcentrez&rendertype=a

J. Brooke, SUS: A “Quick and Dirty”Usability Scale, Usability Evaluation in Industry, Jordan,

AC

PW, Thomas, B., Weerdmeester, BA and McClelland, AL, (1996).

[31]

N. Sundar Rajan, R. Gouripeddi, J.C. Facelli, Measuring Validity of Phenotyping Algorithms across Disparate Data using a Data Quality Assessment Framework, in: 3rd Work. Data Min. Med. Informatics Learn. Heal. AMIA Annu. Symp., American Medical Informatics Association, 2016. http://www.dmmh.org/dmmi16.

ACCEPTED MANUSCRIPT

[32]

J. Merino, I. Caballero, B. Rivas, M. Serrano, M. Piattini, A Data Quality in Use model for Big Data, Futur. Gener. Comput. Syst. (2015).

[33]

R. Gouripeddi, An Informatics Architecture for an Exposome, in: Second. Use Data Res. (Interactive Learn. Annu. Symp. Proc. Am. Med. Informatics Assoc. 2016 Jt. Summits Transl.

[34]

CR IP T

Sci., 2016.

O. Dziadkowiec, T. Callahan, M. Ozkaynak, B. Reeder, J. Welton, Using a Data Quality

Framework to Clean Data Extracted from the Electronic Health Record: A Case Study, eGEMs. 4

AN US

(2016). [37] [35]

I. 11179-4:2004, Information Technology - Metadata Registries (MDR) -- Part 4: Formulation of data definitions, (2004).

W3C, RDF 1.1 Concepts and Abstract Syntax, (2014). https://www.w3.org/TR/2014/REC-rdf11-

M

[36]

[37]

ED

concepts-20140225/.

XML Metadata Interchange Specification Version 2.5.1, (n.d.).

AC

CE

PT

https://www.omg.org/spec/XMI/About-XMI/ (accessed April 29, 2018).