Open data: Quality over quantity

Open data: Quality over quantity

International Journal of Information Management 37 (2017) 150–154 Contents lists available at ScienceDirect International Journal of Information Man...

738KB Sizes 0 Downloads 95 Views

International Journal of Information Management 37 (2017) 150–154

Contents lists available at ScienceDirect

International Journal of Information Management journal homepage: www.elsevier.com/locate/ijinfomgt

Research Note

Open data: Quality over quantity Shazia Sadiq a , Marta Indulska b,∗ a b

School of Information Technology and Electrical Engineering, The University of Queensland, St Lucia, QLD 4072, Australia UQ Business School, The University of Queensland, St Lucia, QLD 4072, Australia

a r t i c l e

i n f o

Article history: Received 22 December 2016 Accepted 8 January 2017 Keywords: Open data Data quality

a b s t r a c t Open data aims to unlock the innovation potential of businesses, governments, and entrepreneurs, yet it also harbours significant challenges for its effective use. While numerous innovation successes exist that are based on the open data paradigm, there is uncertainty over the data quality of such datasets. This data quality uncertainty is a threat to the value that can be generated from such data. Data quality has been studied extensively over many decades and many approaches to data quality management have been proposed. However, these approaches are typically based on datasets internal to organizations, with known metadata, and domain knowledge of the data semantics. Open data, on the other hand, are often unfamiliar to the user and may lack metadata. The aim of this research note is to outline the challenges in dealing with data quality of open datasets, and to set an agenda for future research to address this risk to deriving value from open data investments. © 2017 Elsevier Ltd. All rights reserved.

1. Introduction Open data is data made freely available by governments, organizations, researchers, among others, for use by anyone without copyright restrictions. The growth of the open data movement has been a very significant one. Our review of open data availability indicates that the Australian government open data portal has grown by over 900% between 2013 and 2015 in terms of the number of available datasets (from 573 in December 2013 to 5767 in December 2015). Similarly, other governments have open data initiatives, with New Zealand having over 3800, UK over 23,000, USA over 194,000, and Canada over 240,000 datasets available on their open data government portals (see respective data.gov). These statistics do not include the large numbers of organizational and other datasets made available by non-government sources e.g. GeoNames, Wikidata, DBPedia, to name a few. The proliferation of publicly available datasets (Duus & Cooray, 2016) and emergence of data markets (Elbaz, 2012) presents an unprecedented opportunity to governments, business and entrepreneurs to harness the power of data for economic, social and scientific gains. Open data is envisaged to form the basis of innovation and there are indications that data-driven innovation is

∗ Corresponding author. E-mail addresses: [email protected] (S. Sadiq), [email protected] (M. Indulska). http://dx.doi.org/10.1016/j.ijinfomgt.2017.01.003 0268-4012/© 2017 Elsevier Ltd. All rights reserved.

estimated to have added $67 billion in new value to the Australian economy, that is, 4.4 percent of Australian GDP (PWC, 2014). A number of competitions (hackathons), such as the Australian GovHack or the annual Open Data Day in the USA, have been introduced to mobilize public interest and generate a culture of innovation through data towards economic and societal gains. While open data competitions have given rise to some success stories in terms of start-ups and apps, well documented in published case studies on government open data portals e.g. (Queensland Government, 2016), there is also some evidence that the time-to-value from these datasets remains prohibitively long primarily due to lack of knowledge on the quality characteristics of the data and resulting effort of making the data ready for use (Belkin & Patil, 2016). At the same time, the metadata, as well as the underlying data quality for these datasets, is known to be deficient. For example, many open datasets have duplicate, inconsistent, and missing data and generally lack easily accessible schema descriptions, e.g. the MusicBranz.org open dataset consists of 324 schema-less CSV files with a data volume of 35.1GB. An analysis of open datasets (Zhang, Jayawardene, Indulska, Sadiq, & Zhou, 2014) indicates that many such problems exist in open data. For example, in public transport data, the data consistency of bus stop names is low, which may have serious implications for use of the data that requires grouping or search on bus stop names, such as timetabling and traffic monitoring. Similarly, see Fig. 1 as an example, where several data quality problems can be identified in the USA Gun Offenders Database.

S. Sadiq, M. Indulska / International Journal of Information Management 37 (2017) 150–154

151

Fig. 1. Data quality problems identified in an open dataset.

We note that the value of the dataset is inescapably tied to the underlying quality of the data (Johnston & Carrico, 1988; O’Reilly, 1982). Although for open data, value and quality may be correlated, they are conceptually different (Abiteboul et al., 2015). For example, a complete and accurate list of names of all countries in Asia may not have much value. Whereas incomplete and noisy GPS data from public transport vehicles may have a high perceived value for transport engineers and urban planners. When dealing with such large and unknown datasets, a user might experience long query processing times, only to realize that the results obtained are of poor quality. Alternatively, the user may not realise the data is of inadequate quality, thus affecting any subsequent decisions made, based on the query result (Yeganeh, Sadiq, & Sharaf, 2014). Despite such issues, there is an increasing tendency to gather significant volume of external and internal data into so-called data lakes (Stamford, 2014), which are typically described as enterprise data management platforms for storing, curating and analysing data that comes from a number of disparate sources, including open data sources. Although there is a heightened interest in the big data phenomenon, lessons learnt from years of research in information system use have shown that the assumption of ‘more use is better’ is clearly not the case (Seddon, 1997). As the growth in the amount of open datasets and sources continues at an exponential rate, it is leaving data consumers with a massive footprint of unexplored, unfamiliar datasets that may or may not generate valuable insights for them. Organizations are thus starting to face the ‘dark data’ (Tittel, 2014) syndrome where a large proportion of their information assets are under-utilized. With,out scientifically credible knowledge that provides the ability to efficiently evaluate the underlying quality characteristics of the data, there is a significant risk of organizations and governments accumulating large volumes of low value density data (Curry, 2010), falling into analytical traps (Silver, 2012) and/or investing in low ROI data products (Belkin & Patil, 2016). On February 8th, 2015, a group of global thought leaders from the database research community outlined some grand challenges in getting value from big data (Abiteboul et al., 2015). The key message was the need to develop the capacity to ‘understand how the

quality of data affects the quality of the insight we derive from it’. Given that the social, economic and scientific benefits that justify the global investments into open datasets are still in their infancy there is a need for in-depth understanding of the ‘quality-to-use’ dynamics of open data. In this paper we first outline the state-of-art in evaluating data quality and highlight challenges in applying these techniques for evaluating the quality of datasets that exhibit characteristics typical in the open data space. We then reflect on how these challenges undermine the ability to generate value from open data use and present an agenda for future research to enable requisite understanding of the ‘quality-to-use’ dynamics of open data.

2. Evaluating data quality Data quality has been studied widely by both the research and practitioner community (Sadiq, 2013). Data quality dimensions (Jayawardene, Sadiq, & Indulska, 2013) such as accuracy, completeness, consistency, are a fundamental notion in the definition and measurement of data quality. Evaluating the quality of a data set is a fundamental task in most, if not all, data quality management projects (Batini, Cappiello, Francalanci, & Maurino, 2009). Quality of data is typically evaluated against a certain stated requirement (English, 2009; ISO, 2011; Loshin, 2001). The last 20 years of data quality research (Sadiq, Yeganeh, & Indulska, 2011) have been based on this fundamental principle of fitness for use (Juran et al., 1974). Thus existing methodologies for data quality management are inevitably top-down (McGilvray, 2008; Redman & Blanton, 1997), wherein data quality requirements are determined in a top-down manner following well-understood usage requirements and enforced using good data governance practices. Batini et al. (2009), provide a comprehensive analysis of existing approaches for data quality assessment and requirements identification, indicating that such approaches typically include three core aspects, viz. data and process analysis, data quality requirements analysis, and, data quality analysis. Data and process analysis includes examination of data schemas, performing interviews, and meetings with data users to reach a complete understanding of

152

S. Sadiq, M. Indulska / International Journal of Information Management 37 (2017) 150–154

data, related constraints and rules, and processes creating or consuming the data. Data quality requirements analysis often includes surveys of data users and administrators to identify quality issues, with an aim to identify critical data sets, define data quality metrics, and set quality targets. Data quality analysis then pertains to activities related to data sets exploration, assessment and profiling against the defined data quality metrics. Notable contributions to data quality assessment and requirements identification include (Lee, Strong, Kahn, & Wang, 2002), who propose a data quality assessment and improvement methodology that consists of three components, the PSP/IQ model (Product and Service Performance model for Information Quality), an Information Quality Assessment (IQA) instrument and Information Quality (IQ) Gap Analysis Techniques. The assessment of information quality is conducted through a user survey. Similarly, Naumann and Rolker (2000) present a new classification of IQ criteria based on the source of the IQ score, which are perceptions of the users, the data source and the query process of assessing the information. The assessment methods are subjective to individual user’s experiences and understanding of certain criteria. For example, the criteria ‘interpretability’ and ‘concise-representation’ are both assigned the assessment method of ‘user sampling’. While concise representation is constrained by business rules in certain application contexts, the degree of interpretability of data is subject to the individual user’s perception. It is evident that most, if not all, of these approaches follow a user–centred, top-down approach, where requirements are solicited from users before the data is explored. Such methodologies cover a range of dimensions, but are bound to organizational settings and the data governance landscape for a given company, rendering them ineffective for evaluation of external, unfamiliar datasets. In the current data landscape, users are confronted by new, unexplored, potentially large datasets that arguably have relevance and perceived value for business. In such scenarios applying top-down approaches is not feasible. Users need to be empowered with exploratory capabilities that will allow them to investigate the quality of the data sets and, subsequently, the implications of their use. There are two existing areas where bottom-up methods have been considered for data quality assessment – data exploration and data profiling. Data exploration has been well-researched over the past decade (Dasu & Johnson, 2003) where statistical methods are used to reveal facts about data. These facts are used to formulate quality criteria and, thereby, evaluate quality, followed by data cleansing activities to improve quality. Dasu and Johnson (2003) provide a comprehensive list of existing statistical methods for data exploration. Even though the authors emphasize the possibility of using these methods for the purposes of data quality problem detection, there is a lack of methodology or guidelines for conducting such an exploration of an arbitrary data set. Data profiling is a related concept to data exploration and has a significant commercial tool market. Gartner (Friedman, 2013) estimates that this market reached $960 million in revenue by the end of 2012. Approximately 50% of the market is dominated by a few large and well-established vendors, such as IBM, Informatica, Pitney Bowes, SAP and SAS. The remaining 50% is divided among a large number of providers, including Microsoft, Oracle, Talend, Ataccama and Human Inference and Experian QAS, to name a few. These profiling tools focus on a wide range of capabilities including statistical distribution analysis of data, data redundancy checks, detecting data glitches, functional dependency analysis, column correlation analysis, validity checks etc. Such tools are generally not accompanied by guidelines on how the profiling reports can be used for identification of actionable data quality requirements. While there have been several contributions towards measuring data quality against specific dimensions through data quality pro-

filing (Abedjan, Golab, & Naumann, 2015), statistical approaches (Dasu & Johnson, 2003), as well as work on assessing data quality through the discovery of data dependency constraints (Fan & Geerts, 2012), these solutions are specialized towards specific dimensions (such as consistency, or freshness) and alone are inadequate to capture an accurate and complete picture of the overall data quality, which can span a large number of dimensions (Jayawardene et al., 2013). Additionally, these solutions are generally underpinned by assumptions relating to the availability of certain metadata (e.g., data distributions (Dasu & Johnson, 2003), thresholds (Song & Chen, 2011) and probabilities (Köhler, Link, & Zhou, 2015)), which may not be readily available for open datasets. 3. The need for change The specific setting of open data creation, access and use disable a number of previously successful approaches for evaluating and effectively using data for business outcomes. However, the old adage of “garbage in, garbage out” still presents a significant risk of adverse effects or at least prohibitive delays in the effective use of open data for innovation and productivity gains. We posit that attention needs to be directed to three key areas of research going forward if the value proposition of open data for the information society is to be realized. 3.1. Shared understanding of data quality dimensions Several studies have recently analysed data quality of select open datasets and have indicated issues similar to our study outlined above, albeit with a different set of data quality dimensions and metrics (Rekatsinas, Dong, Getoor, & Srivastava, 2015). A prelude to quality evaluation of data, for which the use context is largely unknown, is the ability to declare the data quality dimensions to be evaluated in a generic manner. Although the concept of a data quality dimension is quite fundamental, there is evidence that overlaps and contradictions have proliferated into the definition of these fundamental definitions over decades of data quality research. These contradictions present a significant roadblock in the ability to reason with data quality dimensions at a generic level. Jayawardene et al. (2013) have consolidated these definitions from a large corpus of academic, practitioner and industry sources into a repository of 33 patterns of data quality (Sadiq, Jayawardene, & Indulska, 2015) supported by an extensive repository of use cases and examples sourced from academic and industry literature, and validated for completeness and applicability. Although the synthesised data quality dimensions (Jayawardene et al., 2013) present a consolidated view of research and practice in data quality over the last two decades, the development of shared understanding over a large and diverse community of data providers and consumers remains to be a significant undertaking. This lack of shared understanding of how to define and reason with data quality is a prohibitive factor in exploiting synergies to deal with data quality resulting in piece-meal and siloed activity within the open data community. Further, our initial research efforts indicate a lack of knowledge on the scale and impact of data quality problems across datasets from various international open data portals. Accordingly, we see a need for a global study, using a consistent comparison benchmark, to explore the extent of the problem, before efforts can be made to address the problem. 3.2. Support for quality awareness One of the biggest risks related to the use of open data is the lack of awareness of the quality inherent in the data. Open data is often used for a purpose that was not originally intended at time of data collection, thus a dataset that might be of sufficient quality

S. Sadiq, M. Indulska / International Journal of Information Management 37 (2017) 150–154

for one purpose may not be fit for purpose for another. Consumers of open data are typically not the producers and hence there is no well-defined strategy for data cleansing, which often results in misguided data curation and transformation (Arocena et al., 2016). Open data consumers may thus invest significant efforts to generate valuable results from the data only to realize that the results are inadequate, or worse may not realise the data is of poor quality, and rely on erroneous results. We see a critical need for exploratory tools and approaches that allow users to become aware of the data’s shortcomings in terms of their intended use. Some efforts have been made to develop quality aware query systems (Yeganeh et al., 2014), exploratory and visual methods (Ehsan, Sharaf, & Chrysanthis, 2016), and various methods to understand data and schema properties (Kruse, Papenbrock, Harmouch, & Naumann, 2016). However, there are many open challenges remaining for both technical as well as empirical researchers before adequate support for quality awareness can be provided to users. 3.3. Strengthening the quality-to-use Nexus The relationship between data quality, intention to use and the effective use of data remains unexplored in academic literature. We argue that there is a critical need for theory development and empirical testing to identify the contexts and factors that affect the effectiveness of open data use, and thus also the value derived from open data. Studies exploring these factors would provide valuable guidance for practical open data projects. While a few recent works address effective use in the Information Systems context (BurtonJones & Grange, 2012), their focus is on the effective use of systems, rather than data. Those systems also contain data that is known to the organisation, rather than open (unfamiliar) data, thus current theories on effective use in the Information Systems context do not offer an explanation for effective use of open data. 4. Conclusions In this paper we have challenged the quantity of available open data given the lack of understanding and even the capacity to understand its underlying quality. We have outlined three areas of need where research and development is needed to further the body of knowledge on effective use of open data. The identified research challenges warrant cross-disciplinary teams spanning across research communities including but not limited to information systems, computer science, statistics, social science and business, as well as the support of agencies managing open data. References Abedjan, Z., Golab, L., & Naumann, F. (2015). Profiling relational data: A survey. The VLDB Journal The International Journal on Very Large Data Bases, 24(4), 557–581. Abiteboul, S., Dong, L., Etzioni, O., Srivastava, D., Weikum, G., Stoyanovich, J., et al. (2015). The elephant in the room: Getting value from Big Data. Proceedings of the 18th international workshop on web and databases. Arocena, P. C., Glavic, B., Mecca, G., Miller, R. J., Papotti, P., & Santoro, D. (2016). Benchmarking data curation systems. IEEE Data Engineering Bulletin, 39(2), 47–62, 2016. Batini, C., Cappiello, C., Francalanci, C., & Maurino, A. (2009). Methodologies for data quality assessment and improvement. ACM Computing Surveys (CSUR), 41(3), 16. Belkin, R., & Patil, D. J. (2016). Everything we wish we’d known about building data products (Accessed 16 February). http://firstround.com/review/everything-wewish-wed-known-about-building-data-products/

153

Burton-Jones, A., & Grange, C. (2012). From use to effective use: A representation theory perspective. Information Systems Research, 24(3), 632–658. Curry, M. (2010). The value density of information.. September 14, (Accessed 16 February 2016). https://mikecurr55.wordpress.com/2010/09/14/the-valuedensity-of-information/ DATA.GOV. (2015). Gun offenders. December 17. http://catalog.data.gov/dataset/ gun-offenders Dasu, T., & Johnson, T. (2003). . Exploratory data mining and data cleaning (Vol. 479) John Wiley & Sons. Duus, R., & Cooray, M. (2016). The future will be built on open data – Here’s why.. February 6, (Accessed February 16). http://theconversation.com/the-futurewill-be-built-on-open-data-heres-why-52785 Ehsan, H., Sharaf, M. A., & Chrysanthis, P. K. (2016). MuVE: Efficient multi-objective view recommendation for visual data exploration. ICDE. Elbaz, G. (2012). Data markets: The emerging data economy.. September 30, (Accessed 16 February). http://techcrunch.com/2012/09/30/data-markets-theemerging-data-economy/ English, L. P. (2009). Information quality applied: Best practices for improving business information processes and systems. Wiley Publishing. Fan, W., & Geerts, F. (2012). Foundations of data quality management. Synthesis Lectures on Data Management, 4(5), 1–217. Friedman, T. (2013). Magic quadrant for data quality tools. Gartner Group. ISO. (2011). ISO/TS 8000-1 Data quality part 1: Overview. ISO. Jayawardene, V., Sadiq, S., & Indulska, M. (2013). The curse of dimensionality in data quality. ACIS 2013: 24th Australasian conference on information systems. Johnston, H. R., & Carrico, S. R. (1988). Developing capabilities to use information strategically. MIS Quarterly, 37–48. Juran, J. M., Gryna, F. M., & Bingham, R. S., Jr. (1974). Quality control handbook, 1974. McGraw-Hill Book Company. Chapters 9:22. Köhler, H., Link, S., & Zhou, X. (2015). Possible and certain sql keys. Proceedings of the VLDB Endowment, 8(11), 1118–1129. Kruse, S., Papenbrock, T., Harmouch, H., & Naumann, F. (2016). Data anamnesis: Admitting raw data into an organization. Bulletin of the Technical Committee on Data Engineering, IEEE Computing Society, 39(June (2)). Lee, Y. W., Strong, D. M., Kahn, B. K., & Wang, R. Y. (2002). AIMQ: A methodology for information quality assessment. Information & Management, 40(2), 133. Loshin, D. (2001). Enterprise knowledge management: The data quality approach. San Francisco, Calif and London: Morgan Kaufmann and Brace Harcourt. McGilvray, D. (2008). Executing data quality projects: Ten steps to quality data and trusted information TM. Elsevier. Naumann, F., & Rolker, C. (2000). Assessment methods for information quality criteria. O’Reilly, C. A. (1982). Variations in decision makers’ use of information sources: The impact of quality and accessibility of information. Academy of Management Journal, 25(4), 756–771. (2014). Deciding with data. Australia: PricewaterhouseCoopers. September. https:// www.pwc.com.au/consulting/assets/publications/data-drive-innovationsep14.pdf Queensland Government. (2016). Queensland Government data.. Last accessed on 25th October 2016. https://data.qld.gov.au/case-studies Redman, T. C., & Blanton, A. (1997). Data quality for the information age. Artech House Inc. Rekatsinas, T., Dong, X. L., Getoor, L., & Srivastava, D. (2015). Finding quality in quantity: The challenge of discovering valuable sources for integration. CIDR. Sadiq, S., Yeganeh, N. K., & Indulska, M. (2011). 20 years of data quality research: themes, trends and synergies. Proceedings of the twenty-second Australasian database conference-volume 115. Sadiq, S., Jayawardene, V., & Indulska, M. (2015). Data quality patterns. (Accessed 16 February 2016). http://dke.uqcloud.net/DataQualityPatterns/ Sadiq, S. (2013). Handbook of data quality. Springer. Seddon, P. B. (1997). A respecification and extension of the DeLone and McLean model of IS success. Information Systems Research, 8(3), 240–253. Silver, N. (2012). The signal and the noise: Why so many predictions fail-but some don’t. Penguin. Song, S., & Chen, L. (2011). Differential dependencies: Reasoning and discovery. ACM Transactions on Database Systems (TODS), 36(3), 16. Stamford, Conn. (2014). Gartner says beware of the data lake fallacy.. July 28. http:// www.gartner.com/newsroom/id/2809117 Tittel, E. (2014). The dangers of dark data and how to minimize your exposure.. September 24, (Accessed 16 February, 2016). http://www.cio.com/article/ 2686755/data-analytics/the-dangers-of-dark-data-and-how-to-minimizeyour-exposure.html Yeganeh, N. K., Sadiq, S., & Sharaf, M. A. (2014). A framework for data quality aware query systems. Information Systems, 46, 24–44. Zhang, R., Jayawardene, V., Indulska, M., Sadiq, S., & Zhou, X. (2014). A data driven approach for discovering data quality requirements. In ICIS 2014: 35th international conference on information systems.

154

S. Sadiq, M. Indulska / International Journal of Information Management 37 (2017) 150–154

Shazia Sadiq is a Professor in the Data and Knowledge Engineering research group at the School of Information Technology and Electrical Engineering at The University of Queensland, Australia. Her research interests include innovative solutions for Information Systems that span several areas including business process management, governance, risk and compliance, and information quality and use. She has published over 100 peer-reviewed publications in high ranking journals such Information Systems Journal, VLDBJ, TKDE, as well as major conferences. Her influential works on declarative modelling of business processes are some of the highest cited works in the area and include an industry patent.

Marta Indulska is Associate Professor and Leader of the Business Information Systems discipline at the UQ Business School, The University of Queensland, Australia. Marta’s main research interests include conceptual modelling, business process management and open innovation. She has published over 80 fully refereed articles in internationally recognized journals and conferences, including MIS Quarterly, Decision Support Systems, European Journal of Information Systems and Journal of the Association of Information Systems. Her research has been funded through several competitive grants. Marta has also worked with several organizations in the retail, consulting and non-profit sectors to provide guidance on Information Technology related topics.