Executable Data Quality Models

Executable Data Quality Models

Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 104 (2017) 138 – 145 ICTE 2016, December 2016, Riga, Latvia Execu...

269KB Sizes 30 Downloads 192 Views

Available online at www.sciencedirect.com

ScienceDirect Procedia Computer Science 104 (2017) 138 – 145

ICTE 2016, December 2016, Riga, Latvia

Executable Data Quality Models Janis Bicevskisa,*, Zane Bicevskab , Girts Karnitisa a

University of Latvia, Riga, Latvia, b DIVI Grupa Ltd, Riga, Latvia

Abstract The paper discusses an external solution for data quality management in information systems. In contradiction to traditional data quality assurance methods, the proposed approach provides the usage of a domain specific language (DSL) for description data quality models. Data quality models consists of graphical diagrams, which elements contain requirements for data object’s values and procedures for data object’s analysis. The DSL interpreter makes the data quality model executable therefore ensuring measurement and improving of data quality. The described approach can be applied: (1) to check the completeness, accuracy and consistency of accumulated data; (2) to support data migration in cases when software architecture and/or data models are changed; (3) to gather data from different data sources and to transfer them to data warehouse. © Published by by Elsevier B.V.B.V. This is an open access article under the CC BY-NC-ND license © 2017 2016The TheAuthors. Authors. Published Elsevier (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of organizing committee of the scientific committee of the international conference; ICTE 2016. Peer-review under responsibility of organizing committee of the scientific committee of the international conference; ICTE 2016 Keywords: Data quality; Domain-specific languages; Business process execution

1. Introduction Data quality can be defined as the degree to which a set of characteristics of data fulfills requirements1. Examples of characteristics are: completeness, validity, accuracy, consistency, availability, and timeliness. Requirements are defined as the need or expectation that is stated, generally implied or obligatory. The ISO 9001:2015 standard considers data quality as a relative concept, largely dependent on specific requirements resulting from the data usage. It means the same data can be of good quality for one usage and completely unusable for another. For instance, to determine a count of students in a high school, only the status of students is of interest, not other data like students’ age or gender.

* Corresponding author. E-mail address: [email protected]

1877-0509 © 2017 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of organizing committee of the scientific committee of the international conference; ICTE 2016 doi:10.1016/j.procs.2017.01.087

Janis Bicevskis et al. / Procedia Computer Science 104 (2017) 138 – 145

To evaluate the data quality for the specific usage, the requirements for data values must be described. The descriptions should be executable, as the stored data will be “scanned” and it’s compliance to requirements will be checked. It should be emphasized that many conditions and requirements can’t be checked during the data input as they are dependent on values of other data objects that are not entered into data base yet. For instance, at the time of student’s enrollment not all information about his/her financial obligations is available and/or entered in the database as the data could be received later. This is the reason why high-quality data in practice occurs rarely. Therefore a topical issue is to analyze and to evaluate the data suitability for specific usages or tasks. The described problem is topical since over 50 years. Traditionally, developers include data quality checking functionality into information systems by two separate types of control: x Syntactic control – values of data objects are checked locally within one record (compliance of input data with the syntax), it also includes compatibility control on linked data of the record x Semantic/ contextual control – checking whether the newly entered data is compatible with data previously entered and stored in database. The semantic control should be repeated every time when new values of data objects’ attributes are entered or existing ones edited The proposed approach intends creating of specific data quality model for each information system. The model is described by using means of a domain specific language, and it lets clearly define requirements for data objects attribute values and compatibility. The data quality model is executable: both the syntactic and the semantic controls are performed. The approach provides the possibility to use the data quality model for measurement and evaluation of data quality as well as for checking of the data to be imported in a data warehouse. The paper deals with following issues: an identification of the problem to be solved and a brief description of basic ideas (Section 2), an overview about the related research (Section 3), a description of the proposed solution (Section 4), and a discussion about the possibilities to use data quality models in practice (Section 5). 2. Statement of the problem and ideas for solutions The data quality management solution proposed by the authors of this paper is based on the usage of executable domain specific languages (DSL). Data quality management is complicated because data in the database is accumulated gradually, according to the flow of received documents or recorded events, and it does not always correspond to the sequence of events in the real life. Data about events entered into database can also be incomplete and in a logically wrong sequence. This research seeks the mechanism supporting the data quality evaluation even when only part of data objects is fully entered into the database. Traditionally, data quality in databases is provided and controlled by setting constraints on the attribute values and relations that may be stored in the database, for example, by using the Object Constraint Language (OCL)2. The OCL allows you to specify in detail constraints on data/ attribute values to be stored in the database, and it preserves the database from an incorrect content. Nevertheless the OCL is not able to solve the problem formulated above – a gradual accumulation of data allowing incompatibility awhile (temporary) but guaranteeing correct data in definite checkpoints (a complete data set about a data object entered in database). The proposed solution supports the gradual filling in the data with data quality controls in specific points (steps in the data quality model). Data quality models are described by using DSL that is suitable for description of data objects attribute values. A data quality model consists of graphical diagrams which are created in the specific DSL and resemble the traditional flow charts (in detail see in the next chapters). Furthermore, not one universal, the “right“ DSL is proposed but a whole language family. The editor building platform DIMOD3 ensures the defining of DSL syntax and creating of the editor suitable for editing of diagrams in the DSL. Data quality controls are the elements of the DSL. The models described by the DSL become executable as soon as the controls are implemented in software routines or by SQL statements.

139

140

Janis Bicevskis et al. / Procedia Computer Science 104 (2017) 138 – 145

3. Related research and solutions At first glance, the OCL approach proposed by the OMG group seems to be the most appropriate for description of data quality requirements. Unfortunately, as in the introduction of this paper already mentioned, the OCL technology is hardly to use for data quality management. The OCL is an extension of UML class diagrams, and it lets you describe the data sets that comply with data quality requirements. Such descriptions are static, and it makes them unusable in cases when data is created and inserted into data base gradually. As the gradual entering of data may cause data quality defects, the data quality checking should be performed every time when data is inserted or modified. And this contrasts with the OCL approach. This chapter contains analysis of related research of other authors that ideologically close to the proposed approach of executable models. 3.1. Executable UML Numerous studies have claimed4 the models should be dynamically executable. According to the researches, the descriptions of business processes should be detailed so clearly and consistently that the processes can be executed automatically without human participation. Even if the processes would be executed by people, the business process descriptions should be unequivocal and definite. A comprehensive overview of the executable UML research roadmap is given E.Seidewitz5. One research field of executable UML deserves more attention - Foundational UML (fUML)6. It is an executable subset of standard UML that can be used to define, in an operational style, the structural and behavioral semantics of systems. This subset includes the typical structural modeling constructs of UML, such as classes, associations, data types and enumerations. But it also includes the ability to model behavior using UML activities, which are composed from a rich set of primitive actions. A model constructed in fUML is therefore executable in exactly the same sense as a program in a traditional programming language, but it is written with the level of abstraction and richness of expression of a modeling language. The idea of this approach is similar to the data quality management approach proposed in this paper. The essential difference is the implementation of data quality checking by using specific software routines or SQL statements. 3.2. Monterey Phoenix The authors of Montherey Phoenix7 offer an original approach to create executable models. The model consists of a set of events, and the event is a process step – activity or task. There are two kinds of relations between events possible – PRECEDES and IN. The relation PRECEDES describes a partially defined event execution sequence. The relation IN describes inclusion of one or several events into one more general event; substantially this relation represents progressive detailing of events where one event can by broken down into “more detailed” events. Semantics of every event is not defined more precisely as the event can be a routine/ source code in programming. The modelling is performed according to the left-to- right and top-down principles. In the case of data quality, the result of data requirements validation serves as an event. If there are no data defects discovered during the validation, the next check is performed; if there is any defect discovered, users are informed via messages. In case the SQL notation is used there will also be an object of another type created: a table containing the incorrect data. The proposed model structure let you make virtually any algorithm description from event sets, including loops, junctions and other constructions traditionally used in programming. Accordingly, the Monterey Phoenix models can describe architecture of information systems precisely without detailing of implementation issues for particular activities. The proposed modelling approach warrants that the models are executable. And executable models can be checked for their correctness as it is possible to detect all meaningful use cases. Checking this use case set, it is possible to identify system errors if they exist. The most positive feature of the Monterey Phoenix approach is the simplicity in conjunction with the richness of available means. There is possible to create high level specification using only some simple notions. However, still open is the issue about limits for applying of the Monterey Phoenix approach: to which extent the functionality of an information system can be described using the approach?

Janis Bicevskis et al. / Procedia Computer Science 104 (2017) 138 – 145

4. Data quality models 4.1. Prototype – working time tracking system (WTT) The proposed ideas will be demonstrated with the help of a simple example. Let us consider a working time tracking system having the ER model given in the Fig. 1. There are many active projects in an enterprise (entity Projects); every project has several employees (entity Developers); each employee (developer) may be involved in several projects; the working time spent by an employee (developer) in a specific time frame is aligned to one specific project (entity Work_time). The entity Projects has the following attributes: x Proj_ID – project identifier/ code that will be used by developers to specify the project to which the spent work time should be referred x Proj_name – project name x Proj_start_date – project start date x Proj_end_date – project end date x Proj_limits –the maximum allowable work amount of the project in man-hours x Proj_actual – project status (active/ passive) x Proj_leader_ID – project manager

Fig. 1. (a) ER model of WTT; (b) Syntactic control of input record.

The entity Developers has the following attributes: x Dev_ID – identifier/ code of the employee to specify the developer to whom the spent work time should be referred x Dev_name – developer's name x Dev_surname – developer's surname x Dev_load – the minimum monthly workload of the developer to be devoted for execution of all projects where the developer is involved The entity Work_time has the following attributes: x Wt_ID – identifier of the spent working time record

141

142

x x x x x x

Janis Bicevskis et al. / Procedia Computer Science 104 (2017) 138 – 145

Wt_hours – spent working time of the developer Wt_date – date of the spent working time Wt_work_descr – description of the performed work Wt_accept – reported working time is accepted by the project manager (Yes/ No) Proj_ID – project identifier specifies the project to which the spent work time should be referred Dev_ID – developer’s identifier specifies the developer to whom the spent work time should be referred

The entity Proj_Dev_time is a junction table for dealing with many-to-many relationships, and it contains the following attributes: x x x x

Proj_ID – identifier of the project where the developer Dev_ID works Dev_ID – identifier of the developer working in the project Proj_ID Start_date – date when the developer Dev_ID started to work in the project Proj_ID End_date – date by which the developer Dev_ID will be assigned to the project Proj_ID

Let us assume, the developers prepare reports about their working time autonomously and send the reports to data warehouse where all enterprise data from various sources is collected. 4.2. Data quality controls The ETL procedure of data warehouse receives messages containing five values of attributes: < Proj_ID, Dev_ID, Wt_date, Wt_hours, Wt_work_descr >

Fig. 2. (a) contextual control on interrelated data; (b) implementation of contextual control on interrelated data.

Data quality checking may be performed in three levels: x Syntactic control (some of possible controls are shown in the Fig. 1b) ensures quality control within one input message: (1) are all mandatory fields completed? (2) have the input values a correct data type? (3) does the input value fulfill the conditions of the field? (examples: the value of a summary field is a sum of other fields values, an employee may report maximum 8 hours in a day, etc.), and others

Janis Bicevskis et al. / Procedia Computer Science 104 (2017) 138 – 145

x Contextual control on interrelated data (see Fig. 2) ensures quality control using attribute values of mutually interconnected data objects: (1) does the message contain object instances with references to other data objects? (2) are the attribute values of input data in compliance with related data objects? (example: the reported working time should be performed during the project’s time), and others x Contextual control on the entire data set (see Fig. 3) checks the compliance with conditions valid for the whole model (examples: isn’t the maximum of work amount allowed for the project exceeded? do the reports of an employee cover the minimal workload of the employee in the time period etc.) This paper emphasizes the role of executable data quality model with three successive data quality controls. The informally described controls should be replaced be executable software routines. The syntactic control on correctness of input fields may be implemented by using external code libraries, accordingly this topic will not be discussed further. The contextual control on interrelated data can be described by SQL statements (see Fig.2b). Of course, the statements will be individual for each usage or task. The contextual control on the entire data set may also be described using SQL statements although the statements could be rather complicated (see Fig.3b). Therefore, it is proposed to create individual data quality models for each usage instead of one universal data quality control solution. Because the data quality controls may be implemented by software routines or SQL statements, the data quality model is executable. In the example given some of executable data quality management activities are not explicitly defined, for instance, assigning of values to variables in software or sending of error messages (activity SendMessage) to users. These activities may be unique but still quite trivial and therefore they are not addressed in the following discussions.

Fig. 3. (a) Contextual control on the entire data set; (b) implementation of contextual control on the entire data set.

In addition it should be noted that the identified three data quality control groups are typical not only for data warehouse ETL procedures but also for information systems using centralized database. The syntactic control in such cases is ensured through screen forms for input of messages with built-in data type controls. The contextual control on interrelated data is performed in online mode comparing values of input fields and database values of interrelated objects. The contextual control on the entire data set is not useful during the input of one separate message as, for instance, the overrun of projects limits may be identified only after input of many messages. In conclusion, the data quality checks described above build a data quality model. Each component of this model can be executed an arbitrary number of times. Validations and their number can easily be changed. Therefore the data quality requirements can be adapted to concrete needs.

143

144

Janis Bicevskis et al. / Procedia Computer Science 104 (2017) 138 – 145

4.3. Implementation of data quality models The data quality model, like business processes, is described using some graphical DSL. Since the data quality requirements are different, the used DSLs can also be different. Hence it is advisable to use not only one specific graphical editor supporting one specific DSL but to create your own editor for each used DSL. Currently there are several such platforms in use; one of them, called DIMOD3 was derived from GrTP8. The business process modeling environment DIMOD is intended to: x Define the DSL using meta-model that is stored into model’s repository. DSL parameters may be defined and modified using the separate configuration component Configurator9. Once the DSL is defined the corresponding modeling editor is created for the DSL automatically x Create and edit data quality diagrams in the DSL. This is usually done by some highly qualified modeling experts in collaboration with domain experts (“clever users”) x Check the data quality models’ internal consistency. Both IT and domain experts are involved in it; to publish the created model/diagrams in WEB. It allows a wide range of users to use the data quality model diagrams Once the DSL is defined and an appropriate editor is created, a data quality model can be designed including the data quality checking parameters for the specific information system. Initially the model can be informal, e.g. the data validations could be described in a textual way. Afterwards the informal description is substituted by a source code or SQL statements making the informal model executable. In the next step, the formal data quality model may be translated to software routine or an universal interpreter can be created which is able to execute the data quality model. 5. Usage of data quality models Theory of total data quality management10 describes main principles of data quality and methods of its evaluation. This paper proposes an implementation mechanism of this theoretical methodology using executable data quality models. This was the main goal of the research. In practice the proposed data quality checking technology is primarily suitable for information systems using relational databases. As the data base structure (ER model) is relatively constant, the data quality model may be used in very effective way. Before the use of the data the data quality model is executed and all discrepancies with requirements are identified. The user may decide to improve data quality or to use the data as it is. The proposed technology can also be used for the control of input data in very complex screen forms with many interrelated or conditionally linked records. For example, a number of input fields in customs declaration documents may exceed 50 in one screen form. The created data quality model may serve as a precise specification for software development and a testing model for the created software. A similar situation arises by importing data in data warehouses. Since various mutually unrelated databases may serve as data sources, the compatibility and compliance with requirements of the data sources should be checked before data importing into the data warehouse. The data quality model lets define compliance requirements precisely and in detail. The proposed technology is applicable not only for relational databases but also for document-oriented NoSQL databases with XML documents. The main difference: in case of document-oriented database the requirements will be described by XQuery statements (instead of SQL statements), and addressing to the data objects attributes will not be processed via column names of the tables in the ER model. 6. Conclusion The data quality management in this paper is seen as a dynamic process including gradual supplementation of the data set and repeatedly data quality control with syntactic and semantic validations. Data quality is controlled according to the data quality model. It is implemented by graphical diagrams in specifically for data quality needs created DSL. Data quality validation steps (controls) are elements of the DSL, and

Janis Bicevskis et al. / Procedia Computer Science 104 (2017) 138 – 145

they are arranged in the sequence of their execution. Data quality controls describe conditions for checking of data object attribute values. In addition the controls can be joined in loops being able to “scan” data object classes. The proposed approach let solve data quality problems during the initial data entering in data base as well as repeatedly by controlling the completeness of entered data or timeliness of the input. Acknowledgments The research leading to these results has received funding from the research project "Competence Centre of Information and Communication Technologies" of EU Structural funds, contract No. 1.2.1.1/16/A/007 signed between IT Competence Centre and Central Finance and Contracting Agency, Research No. 1.8 “Data quality management by executable business process models ". References 1. 2. 1. 3. 4. 5. 6. 7. 8. 9.

ISO 9001:2015. Quality management principles. Available: http://www.iso.org/iso/pub100080.pdf. OCL 2.0. Object Constraint Language™. Version 2.0. Release date: May 2006. Available: http://www.omg.org/spec/OCL/2.0/. Bicevskis J, Bicevska Z. Business Process Models and Information System Usability. Procedia Computer Science. 77; 2015. p. 72– 79. Haan JD. 8 Reasons Why Model-Driven Approaches (will) Fail; 2008. Available: http://www.infoq.com/articles/8-reasons-whyMDE-fails. Seidewitz E. Executable UML Roadmap; 2014. Available: http://www.slideshare.net/seidewitz/xumlpresentation-140917-executableuml-roadmap. Lockheed Martin: Model Driven Solutions. Available: http://modeldriven.github.io/fUMLReferenceImplementation/. Augustsons M. Monterey Phoenix, or How to Make Software Architecture Executable. Orlando, OOPSLA Companion; 2009. p. 1031-1038. Barzdins J, Zarins A, Cerans K, Kalnins A, Rencis E, Lace L, Liepins R, Sprogis A. GrTP: Transformation Based Graphical Tool Building Platform; 2014. Available: http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-297/. Sprogis A. Configuration Language for Domain Specific Modeling Tools and Its Implementation. Baltic J. Modern Computing. Vol. 2 (2); 2014. p. 56-74. Wang RY. A Product Perspective on Total Data Quality Management. Available: https://pdfs.semanticscholar.org/6124/b6ede253fe73ca122bc744b4b74b2cb92f66.pdf.

Janis Bicevskis, born in 1944, Doctor of Computer Science since 1992, Professor since 2002, has published more than 100 scientific papers and has participated in many software development projects (design, implementation, programming, testing, management) since 1968. He led the Department of Computer Science at University of Latvia, he was co-author of national program “Informatics”, a project manager of Latvia Education Information system in 1999, and he was manager of “Integration of National information systems (Megasystem)” project in 1998. His main current scientific interests are software engineering and software testing. Contac him at [email protected]. Zane Bicevska, born in 1971, earned her Doctor degree in Computer Science from the University of Latvia in 2010 and Master degree in Economics from the Johann Wolfgang Goethe-Universität Frankfurt am Main in 2001. She has published more than 15 scientific papers and has worked as a project manager in various software development projects. Since 2010 Zane Bicevska is Assistant Professor in the Computing Faculty of the University of Latvia. Her main scientific interests include project management, business process modelling, smart technologies.

145