Literature Review of Data Model Quality Metrics of Data Warehouse

Literature Review of Data Model Quality Metrics of Data Warehouse

Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 48 (2015) 236 – 243 International Conference on Intelligent Comput...

404KB Sizes 0 Downloads 55 Views

Available online at www.sciencedirect.com

ScienceDirect Procedia Computer Science 48 (2015) 236 – 243

International Conference on Intelligent Computing, Communication & Convergence (ICCC-2014) (ICCC-2015) Conference Organized by Interscience Institute of Management and Technology, Bhubaneswar, Odisha, India

Literature Review of Data model Quality metrics of Data Warehouse Anjana Gosaina, Heenaa a

GGSIPU, Sector 16C Dwarka,New Delhi, India

Abstract Quality of data warehouse is very crucial for managerial strategic decisions. Multidimensional data modeling has been accepted as a basis for data warehouse, thus data model quality has a great impact on overall quality of data warehouse. Metrics act as a tool to measure the quality of data warehouse model. Various authors have proposed metrics to assess the quality attributes of conceptual data models for data warehouse such as understandability, maintainability etc. All the related research work inspires us to investigate the metrics proposed to measure data warehouse data model quality, the various quality factors assessed and to provide a ground work for research advancement in this field. A total of 22 studies were selected and analyzed to identify the various validation techniques used to prove usage and practical utility of metrics and the quality factors measured by these metrics. Opportunities for future work lie in the gaps that were found in the validation of the metrics and the lack of quality factors measured. © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of scientific committee of International Conference on Computer, Communication and Convergence (ICCC 2015) Keywords:Literature review, Data warehouse Data Model Quality Metrics, Conceptual Data model quality metrics, Understandability

1. Introduction A Data Warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of

1877-0509 © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of scientific committee of International Conference on Computer, Communication and Convergence (ICCC 2015) doi:10.1016/j.procs.2015.04.176

Anjana Gosain and Heena / Procedia Computer Science 48 (2015) 236 – 243

237

management's decision making process1. But lack of quality in the information being provided by data warehouse can lead to bad strategic decisions. Thus, information quality in data warehouse needs to be assured which further depends on presentation quality, data quality and data model quality (both physical and logical model2. It has been widely accepted that data modeling for data warehouses is based on multidimensional modeling and OLAP tools directly access multidimensional schemas of data warehouses to help decision makers. Thus, quality of data model has great influence on the overall quality of data warehouse. Multidimensional schemas organize data in facts and dimensions. Facts contain the measures for analysis along the dimensions. Dimensions contain attributes, granularity levels and hierarchies for better representation of data. All these structural elements and the relationship between them affect the complexity of the schema and this structural complexity notably determines the data model quality. Earlier in 21 quality of data warehouse conceptual models has been considered as intuitive notions but these intuitive notions of quality needs to be substituted by a set of quantitative measures to increase objectivity. Metric provides a way to measure a quality factor in a consistent and objective manner 2. Various authors have proposed several metrics to measure the quality factors of conceptual data models such as maintainability, understandability etc6,8,10,14,15,18,22,24 for data warehouse. These metrics are validated theoretically and empirically. Therefore, in this paper we present a Systematic literature review of data model quality metrics of data warehouse in order to know the current state of art and explore the opportunities for further research. The remainder of the paper is organized as- Section 2 describes the steps taken to conduct SLR for our research field. Section 3 summarizes various metrics proposed to access the quality of conceptual data models of data warehouse. Section 4 deliberates about the findings of the SLR elaborating empirical evidences, data warehouse quality factors and existing gaps in the research area. Section 5 concludes the work. 2. Systematic Literature Review (SLR) Process SLR is a form of secondary study that uses a well-defined methodology to identify, analyze and interpret all available evidence related to a specific research question in a way that is unbiased and (to a degree) repeatable3. Following the guidelines stated by Kitchenman3, series of action followed to direct our research are as shown in Fig -1 and are detailed in the following subsections. 2.1 Research Questions Research questions forms the important part of SLR as they only direct the research with the intent of finding the appropriate studies, analyzing them and extracting useful data to find their answers. Our work intend to find out answers for the following research questions RQ 1: What are the existing empirical evidences for various metrics proposed to measure conceptual data model quality for data warehouse? RQ 2: What are the quality factors that have been related to proposed metrics? RQ 3: What are the existing gaps in the current research in the use of quality metrics in data model of data warehouse? 2.2 Search Process Sources of Information Following 5 online databases were used to extract required studies for reviewa) IEEE Xplore (http://ieeexplore.org/) b) Springer LNCS (http://www.springer.com/gp/) c) Science Direct (http://www.sciencedirect.com/) d) ACM digital library (http://dl.acm.org/) e) Google Scholar (http://scholar.google.co.in/) Initial search string was needed to extract relevant studies from the databases and search string defined for our work was: (("data quality" OR "data warehouse quality" OR "multidimensional data quality" OR " data warehouse conceptual model") AND ("metrics" OR "measures")).

238

Anjana Gosain and Heena / Procedia Computer Science 48 (2015) 236 – 243

Our search was divided into two phases - primary and secondary search a) Primary search- This search was directed to find out studies from online databases, electronic journals, online search engines and conference proceedings. Our search included published work from year 2001 to 2014 since first work on data warehouse data model quality metrics was carried out in 2001. The search string was applied on full text while searching in these databases. These selected works were assumed to be of high quality as they must have gone through strict review procedure to get published in the selected sources. b) Secondary search - This search aimed at finding out studies by reviewing the references and citations of the selected studies during primary search. This kind of search is conducted to cover wider range of papers and increasing the reliability of not missing any important literature related to our field.

Fig 1- SLR Process Steps

2.3 Inclusion and Exclusion Criteria These criteria are needed to filter out the relevant studies related to our search field. We included studies that were: 1. Proposing and discussing metrics for data warehouse conceptual data model quality 2. Comparing different metrics based on different validation techniques 3. Theoretically, empirically or experimentally validating quality metrics for data model 4. Proving metrics to be quality indicators for various quality attributes We excluded studies that were: 1. Not in English 2. Relate only to data quality and not data model quality 3. Published in posters, eBooks, editorials, news and magazines 4. Purely discusses conceptual models and not their metrics 2.4 Study Selection All the works were considered relevant that discussed the quality of conceptual data models of data warehouses and proposed metrics for the quality attributes affecting those data models. To extract studies from sources of information based on our research questions, we conducted our primary search and initially 18088 articles were retrieved. Firstly papers were filtered out on the basis of their titles. Papers related to physical and logical data models were excluded leaving 453 studies. Papers discussing only the conceptual model of data warehouse and not about their metrics were left out constituting around 100 papers. Then we excluded irrelevant papers based on their abstracts and 65 studies were filtered out for our research. Finally full texts were read and a list of 22 studies was found relevant for our research based on exclusion and inclusion criterions. Then secondary search was conducted but no new studies could be identified as all were already listed in our primary search list.

239

Anjana Gosain and Heena / Procedia Computer Science 48 (2015) 236 – 243

3. Background Work Summary of various metrics proposed, validation techniques used and quality factors assessed is given in Table 1. Table 1 – Summary of proposed metrics for Data Warehouse Data Model Quality Authors

Metrics proposed/discussed

Publication Year

Theoretical validation

Empirical Validation

Quality Factor discussed

Piatinni et.al2

Table Metrics Star Metrics Schema Metrics

2001

Measurement theory based framework

Not done

None

Piatinni et.al 5

Schema Metrics

2002

Done in 2

Non-parametric co relational analysis conducted using Kendall's tan_b

Complexity

Si-Said Prat N9

Legibility metrics Expressiveness metrics Simplicity metric Correctness metric

2003

Not done

Not done

Legibility, Expressiveness, Simplicity, Correctness

Piatinni et.al 6

NFT, NDT

2003

Done in 2

Parametric test- F-statistic

Understandability

Piatinni et.al 6

Class scope metrics, Star scope metrics, Diagram scope metrics

2004

Not done

Non-parametric correlational analysis using Spearman’s Rho statistic

Maintainability, Understanding time

Piatinni et.al 7

Diagram level Package level

2005

Axiomatic and measurement based theory

Not done

Complexity

Piatinni et.al 10

Star scope metrics proposed in 8

2007

DISTANCE framework

Non-parametric correlational analysis using Spearman's Rho statistic

Understandability Effectiveness Efficiency

Piatinni et.al 11

Schema metrics (NFT(Sc), NDT(Sc), NFK(Sc), NMFT(Sc)) Proposed in 2

2008

Axiomatic and measurement based theory and DISTANCE framework

Spearman's Rho correlational analysis, Multivariate Linear Regression, Case Based Reasoning, Formal Concept Analysis, Bayesian Classifiers

Understandability

Gosain A, Gupta R 12

Schema Metrics proposed in 2

2010

Done in 2

Stepwise Linear Regression

None

Gosain A, Nagpal S, Sabharwal S 14

Schema Metrics (NFT(Sc), NDT(Sc), NFK(Sc), NMFT(Sc)) Proposed in 2

2011

Done in 2

Parametric tests- Descriptive statistics, Kolgomorov Smirnov, Pearson Correlation, Univariate Regression analysis, Multiple Regression

Understandability

Gosain A, Nagpal S, Sabharwal S 15

Dimension Hierarchy metrics

2011

Not Done

Not Done

Maintainability Understandability

Gupta J, Gosain A, Nagpal S23

Diagram level Package level metrics proposed in 7

2011

Done in 7

Correlation and regression analysis

None

Gosain A, Gupta R13

Schema Metrics proposed in 2

2012

Done in 2

Principal Component Analysis

None

Gosain A, Nagpal S, Sabharwal

Schema Metrics (NFT(Sc), NDT(Sc), NFK(Sc), NMFT(Sc))

2012

Done in 2

Fuzzy Logic

Understandability

240

Anjana Gosain and Heena / Procedia Computer Science 48 (2015) 236 – 243 S 22

Proposed in 2

Gosain A, Nagpal S, Sabharwal S 16

Dimension Hierarchy metrics proposed in 15

2012

Axiomatic approach

Parametric Test- Pearson correlation Non parametric test- Spearman's Rho, Kendall's Tau_b Linear Regression

Understandability

Gosain A, Nagpal S, Sabharwal S 17

Complexity metricScomp, Dcomp

2012

Not done

Not done

None

Ali K.B, Gosain A

Star scope metrics (NDC(S),NBC(S), NC(S), NA(S)) proposed in 8

2012

Done in 10

Decision Tree approach

None

Ali K.B, Gosain A27

Star scope metrics (NDC(S),NBC(S), NC(S), NA(S)) proposed in 8

2012

Done in 10

Fuzzy Logic

None

Gosain A, Nagpal S, Sabharwal S 18

Complexity metricScomp, Dcomp proposed in 17

2013

Axiomatic approach

Parametric Test- Pearson correlation Non parametric test- Spearman's Rho, Kendall's Tau_b Ordinal Regression

Understandability

Gosain A, Mann S24

Star scope metrics proposed in 8

2014

Done in 10

Multiple Linear regression

Understandability, Efficiency

Kumar M, Gosain A, Singh Y25

Star Scope Metrics proposed in 8

2014

Done in 10

Univariate and multivariate Regression analysis, machine learning methodsDecision Trees, Naive Bayesian Classifierr

Understandability

Gosain A, Singh J 28

Metrics on Dimension Hierarchy Sharing

2014

Not done

Not done

Complexity

26

4. Literature Review Results Total number of relevant studies for our research after searching online databases was 22. The distribution off these studies shows that more than half of the papers were published in journals. Then majority of the remaining were published in conference proceedings and very few in workshops.

Fig 2- Distribution of papers

4.1 Research Question 1- Empirical Evidences for Various Proposed Metrics To ensure that proposed metrics are reliable and have practical usage one needs to validate metrics. Metrics can

Anjana Gosain and Heena / Procedia Computer Science 48 (2015) 236 – 243

241

be validated in two ways- theoretical validation and empirical validation 4 to prove their usage and practical utility. This research question aims to find out various metrics proposed for data warehouse data model quality and how have they validated them to prove their correctness and utility. Various authors have theoretically validated their metrics using measurement theory framework 2,7,11, axiomatic approach 7,11, 16, 18 and DISTANCE framework 10,11.

Fig 3 a) Article Distribution by theoretical validation technique

b) Advent of Empirical Validation Techniques

Fig 4 a) Distrubution of papers by Non-parametric test technique b) Distrubution of papers by parametric test technique

Fig 5 a) Distrubution of papers by Regression technique

b) Distrubution of papers by Non-parametric test technique

Empirical validation techniques used for validation of multidimensional model quality metrics are broadly classified into 4 techniques- Non-parametric correlational analysis, parametric analysis, Regression Analysis and Machine Learning. Piatinni et.al 5 started empirical validation of multidimensional model metrics using non-

242

Anjana Gosain and Heena / Procedia Computer Science 48 (2015) 236 – 243

parametric correlational analysis in 2001. Various research works then used this analysis technique using different statistics such as Kendall Tau_b 5,16,18 and Spearman’s Rho 8,10,11,16,18. In 2003 parametric test was first used using Fstatistic6 and other methods used included Descriptive statistics14, Kolgomorov Smirnov14 and Pearson correlation14,16,18. Finally in 2008, other techniques such Regression Analysis11,12,13,14,23,16,18,24,25 and Machine Learning11,22,26,25,27 were seen to be used to empirically validate these metrics. Number of papers using these empirical techniques is shown graphically in Fig 4, 5. 4.2 Research Question 2- Quality factors assessed by proposed metrics This question was aimed at finding quality factors that have been discussed (investigated/considered) for data warehouse data model. Designers need to decide their selection criteria to choose best dimensional model according to their requirements. This selection criterion is based on some quality factors. The most common quality factor discussed by authors 6,8,10,14,15,16,18,22,24,25 is understandability as discussed by 10 out of 22 studies. Other factors considered are simplicity, legibility, expressiveness, correctness, efficiency and effectiveness9,10. One of the studies proposed metrics to quantify multidimensional schema for simplicity, expressiveness, correctness and legibility but couldn’t prove their validity9. Two of the other studies validated metrics for effectiveness10 and efficiency10, 24 and proposed metrics proved out to be useful for measuring efficiency only but didn’t quantify effectiveness10. Remaining studies didn’t focus on any particular quality attribute rather they proposed metrics and proved their utility just as quality indicators.

Fig 6- Distribution of papers by Quality Factor

4.3 Research Question 3- Existing Gaps in the research ISO/IEC 912620 standard considers quality of a software can be evaluated by considering six characteristics – functionality, usability, reliability, efficiency, portability and maintainability. As validated in paper29, quality factors that can be considered for a data model includes completeness, correctness, integrity, simplicity, flexibility, integration, understandability and implementability. While only maintainability and one of its sub characteristic that is understandability has been the main concern of quality in data models of data warehouse in most of the research. Other factors are also equally considered important for software so they also need to be discussed in context of data models quality in data warehouses. Metrics need to be theoretically and empirically validated. But metrics measuring effectiveness, simplicity, legibility, expressiveness and correctness as discussed in 9,10 have not been validated to prove their utility in real environments. 5. Conclusion In this paper we conducted a Systematic Literature Review to present a summarized view on the research undergone in the area of quality metrics for conceptual models of data warehouse. 22 studies were identified relevant for our work and were categorized on the basis of validation techniques of metrics and quality factors assessed by the metrics. The results indicated that various empirical validation techniques such as parametric and non parametric test, regression analysis and machine learning techniques have been used to validate metrics. Majority of the research work focused on complexity and understandability quality factors to measure quality of

Anjana Gosain and Heena / Procedia Computer Science 48 (2015) 236 – 243

243

conceptual models of data warehouse. And existing gaps in this research lie in the missing validations of various metrics and various quality factors that have not been accessed in context to conceptual data models of data warehouse. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29.

Inmon. W. Building the Data Warehouse; 1991. p. 25 Calero C, Piattini M, Pascual C, Serrano M.A. Towards Data warehouse quality metrics. In 3rd International workshop on design and Management of Data warehouses (DMDW 2001), Interlaken, Switzerland; 2001 Kitchenham B, Brereton O.P, Budgen D, Turner M, Bailey J, Linkman S. Systematic literature reviews in software engineering – a systematic literature review, Journal of Information and Software Technology, Vol. 51, No. 1. 2009; p. 7-15. Calero C, Piattini M, Genero M. Method for obtaining correct metrics. In 3rd International Conference on Enterprise and Information Systems (ICEIS`2001); 2001. Serrano M, Calero C, Piattini M. Validating metrics for data warehouses, IEE Proceedings SOFTWARE 149; 2002. p 161–166. Serrano M, Calero C, Piattini M. Experimental validation of multidimensional data models metrics, Proc of the Hawaii International Conference on System Sciences (HICSS’36), IEEE Computer Society; 2003 Berenguer G, Romero R, Trujillo J, Serrano M, Piattini M. A Set of Quality Indicators and Their Corresponding Metrics for Conceptual Models of Data Warehouses, Lecture Notes in Computer Science 3589 (DaWaK 2005), Springer. ISSN: 0302-9743. 2005; p. 95-104. Serrano M, Calero C, Trujillo J, Lujan, Piattini M. Empirical validation of metrics for conceptual models of data warehouse, 16th International Conference on Advanced Information Systems Engineering (CAISE’04), Riga, Latvia; 2004. p. 506–520. Si-Saıd, Prat N. Multidimensional Schemas Quality: assessing and Balancing Analyzability and simplicity, ER 2003 Workshop; 2003 pp. 140-151 Serrano M, Trujillo J, Calero C, Piattini M. Metrics for data warehouse conceptual models understandability, Journal of Information and Software Technology 49; 2007. p. 851-870. Serrano M.A, Calero C, Sahraouli H, Piattini M. Empirical studies to assess the understandability of data warehouse schemas using structural metrics, Softw. Qual. J.; 2008, 16, (1), pp. 79–106 Gupta R, Gosain A. Analysis of data warehouse quality metrics using LR, International conference on Information and Communication Technologies; 2010 Gupta R, Gosain A. Validating data warehouse quality metrics using PCA. In: Kannan, R., Andres, F. (eds.) ICDEM 2010. LNCS, vol. 6411, Springer, Heidelberg; 2012. p. 170-172 Gosain A, Sabharwal S, Nagpal S. Assessment of quality of data warehouse multidimensional model, Int. J. Inf. Qual.; 2011 2(4), p. 344– 358 Gosain A, Nagpal S, Sabharwal S. Quality metrics for conceptual models for data warehouse focusing on dimension hierarchies, ACM SIGSOFT Softw. Eng. Notes, 2011, 36, (4), pp. 1–5 Gosain A, Nagpal S, Sabharwal S. Validating dimension hierarchy metrics for the understandability of multidimensional models for data warehouse. IET Software 7(2); 2013. p. 93–103 Nagpal S, Gosain A, Sabharwal S. Complexity metric for multidimensional model for data warehouse. International Information Technology Conference and Exhibition, Pune; 2012 Nagpal S, Gosain A, Sabharwal S. Theoretical and empirical validation of comprehensive complexity metric for multidimensional models for data warehouse. International Journal of System Assurance Engineering & Management; 2013. p. 193–204 Briand L.C, Morasca S, Basili V.R. Property based software engineering measurement, IEEE Trans. Softw. Eng.; 1996. pp. 68–86 ISO/IEC Standard 9126 Software product evaluation – quality characteristics and guidelines for their use; 1999’. Jarke M, Lenzerini M, Vassiliou Y, Vassiladis P. Fundamentals of Data Warehouses. Springer Verlag; 2002 Gosain A, Sabharwal S, Nagpal S. Predicting quality of data warehouse using fuzzy logic, Int. J. Business and Systems Research, Vol 6, No 3; 2012 p. 255-268 Gupta J, Gosain A, Nagpal S. Empirical validation of object oriented data warehouse design quality metrics. ACITY, CCIS 198, SpringerVerlag Berlin Heidelberg; 2011 p. 320--329 Gosain A, Mann S. Empirical validation of metrics for object oriented multidimensional model for data warehouse. International Journal of System Assurance Engineering & Management, Springer, Vol. 5, Issue 3; 2013, p. 262-275 Kumar M, Gosain A, Singh Y. Empirical validation of structural metrics for predicting understandability of conceptual schemas for data warehouse. International Journal of System Assurance Engineering & Management, Springer, Vol 5, Issue 3; 2013 p. 291-306 Ali K.B, Gosain A, Predicting the Quality of Object-Oriented Multidimensional (OOMD) Model of Data Warehouse using Decision Tree Technique, IJSER, Volume 3,Issue 8; 2012 p. 816-820 Ali K.B, Gosain A. Predicting the quality of object oriented multidimensional (OOMD) model of data warehouse using fuzzy logic technique. International Journal of Engineering Science & Advanced Technology, Volume 2, Issue 4; 2012 p.1048-1054. Gosain A, Singh J. Quality Metrics for Data Warehouse Multidimensional Models with Focus on Dimension Hierarchy Sharing. Advances in Intelligent Informatics, Springer 2014 p. 429-443 Moody D.L, Shanks G.G, Improving the quality of data models: empirical validation of a quality management framework, Information Systems, v.28 n.6, September 2003 p. 619-650