Design a Data Warehouse Schema from Document-Oriented database

Available online at www.sciencedirect.com Available online at www.sciencedirect.com Available online at www.sciencedirect.com ScienceDirect Procedi...

Download PDF

2MB Sizes 0 Downloads 142 Views

Report

PDF Reader
Full Text

Available online at www.sciencedirect.com Available online at www.sciencedirect.com

Available online at www.sciencedirect.com

ScienceDirect

Procedia Computer Science 00 (2019) 000–000 Procedia Computer Science (2019) 000–000 Procedia Computer Science 15900 (2019) 221–230

www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia

23rd International Conference on Knowledge-Based and Intelligent Information & Engineering 23rd International Conference on Knowledge-Based Systems and Intelligent Information & Engineering Systems

Design Design aa Data Data Warehouse Warehouse Schema Schema from from Document-Oriented Document-Oriented database database Senda Bouaziza,∗ , Ahlem Nablib , Faiez Gargouric Senda Bouaziza,∗, Ahlem Nablib , Faiez Gargouric

a MIRACL Laboratory, Faculty of Economics and Management Sfax, BP 3018, Tunisia a MIRACL Laboratory, Faculty of Economics and Management Sfax, BP 3018, Tunisia b MIRACL Laboratory, Faculty of Computer Sciences and Information Technologies, Al-Baha University, b MIRACL Laboratory, Faculty of Computer Sciences and Information Technologies, Al-Baha University, c MIRACL Laboratory, Institute of Computer Science and Multimedia Sfax, BP 3021, Tunisia c MIRACL Laboratory, Institute of Computer Science and Multimedia Sfax, BP 3021, Tunisia

KSA KSA

Abstract Abstract Traditional data warehouses are unable to meet the growing needs of the modern enterprise to integrate and analyze a wide variety Traditional data warehouses unable to meet growingWhat needsisofremarkable the modernisenterprise integrate and analyze a wide variety of data generated by social,are mobile and sensorthesources. that manyto companies have changed their data of data in generated social, mobile andimportant sensor sources. What remarkable is that many companiesto have data storage NoSQL by databases. Given the role played by isthis type of databases, it is necessary studychanged NoSQL their databases storage in NoSQL databases. Given the important role played by this type of databases, it is necessary to study NoSQL databases as a data source for the modeling and the implementation of data warehouses. This paper proposes a method to design the data as a data source forfrom the modeling anddatabases the implementation of data warehouses. This paper proposes a method to design the data warehouse schema schema free known as NoSQL databases. Our proposal starts with the extraction of schemes warehouse schema from schema free known as NoSQL databases. proposal starts with based the extraction of schemes from document-oriented databases as databases an example of NoSQL database. This Our extraction is performed on the MapReduce from document-oriented databases as an identification example of NoSQL database. This extraction is This performed on the paradigm. Then, it defines the structure graph from the extracted schemes. graph based will help the MapReduce designer to paradigm. Then, it defines the structure identification graph from the extracted schemes. This graph will help the designer to identify the multidimensional concepts for each schema extracted in order to design the global data warehouse schema. identify the multidimensional concepts for each schema extracted in order to design the global data warehouse schema. c 2019 Author(s).Published PublishedbybyElsevier ElsevierB.V. B.V. © 2019 The The Authors. c 2019an The Author(s). Published bythe Elsevier B.V. This This is is an open open access access article article under under the CC CC BY-NC-ND BY-NC-ND license license(https://creativecommons.org/licenses/by-nc-nd/4.0/) (https://creativecommons.org/licenses/by-nc-nd/4.0/) This is an open access article under the CC BY-NC-ND Peer-review Peer-review under under responsibility responsibilityof ofKES KESInternational. International. license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of KES International. Keywords: NoSQL databases; Map-Reduce; Data Warehouse; Schema Design; Document-oriented database; Extraction. Keywords: NoSQL databases; Map-Reduce; Data Warehouse; Schema Design; Document-oriented database; Extraction.

1. Introduction 1. Introduction Due to the increasing amount of social media data and the Internet of Things, their volumes and levels of details, to the of social media and the data Internet of Things, their volumesa and we Due witness anincreasing increasingamount explosion of stored anddata circulating on the web. This produced hugelevels flow of of details, data in we witness an increasing explosion of stored and circulating data on the web. This produced a huge flow of the dataBig in various formats. These data can be structured, unstructured or semi-structured. This movement gave birth to various formats. These data can be structured, unstructured or semi-structured. This movement gave birth to the Big Data phenomenon. These data are defined as collections of data that are as large and complex to manage as it has Data phenomenon. These data are and defined as collections of datadatabase that aremanagement as large and tools. complex to manage as it has become almost impossible to store process using traditional become almost impossible to store and process using traditional database management tools. Companies are increasingly interested in unstructured data access and their integration with structured data. UnCompanies interested for in unstructured data access andtakes theiraintegration structured Unstructured dataare is aincreasingly significant challenge scientists because it often lot of timewith to structure anddata. prepare structured data is a significant challenge for scientists because it often takes a lot of time to structure and prepare ∗ ∗

Corresponding author. Tel.: +216 96101878. Corresponding Tel.: +216 96101878. E-mail address:author. [email protected] E-mail address: [email protected]

c 2019 The Author(s). Published by Elsevier B.V. 1877-0509 c 2019 1877-0509 Thearticle Author(s). Published by Elsevier B.V. (https://creativecommons.org/licenses/by-nc-nd/4.0/) This is an open access under the CC BY-NC-ND 1877-0509 © 2019 Thearticle Authors. Published by Elsevier license B.V. (https://creativecommons.org/licenses/by-nc-nd/4.0/) This is an open access under the CC BY-NC-ND license Peer-review under responsibility KES International. This is an open access article of under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of KES International. Peer-review under responsibility of KES International. 10.1016/j.procs.2019.09.177

222

Senda Bouaziz et al. / Procedia Computer Science 159 (2019) 221–230 Senda Bouaziz et al. / Procedia Computer Science 00 (2019) 000–000

the data for analysis [10]. The integration of stored data into structured and unstructured formats can add significant value to an organization. Furthermore, the analyzing and processing unstructured data, formatting and merging it with traditional structured data provides decision makers with a better insight into the business [7]. Therefore, the relational databases, which have been the perfect support for storing data for many decades, are no longer suited to the Big Data phenomenon. In this context that we are witnessing the appearance of the Not-only-SQL systems, which present an alternative to relational databases. This new generation of databases has the power not to handle only a large amounts of data, but also the variety and the velocity of data. These databases are used by the largest organizations in the world such as Google, Amazon, etc. Most NoSQL databases are schema-free or at least have very relaxed schemas. There is often no need to define any sort of schema for the data. To provide to the organizations a point of view in 360-degree of business direction and customer behavior analysis, it is desirable to keep this unstructured data in the data warehouse, which is the best place to store such data. Several problems would arise if the unstructured data will be stored elsewhere than in the data warehouse. The structure of NoSQL is different from that of the RDBMS. However, most of the work to create and design a data warehouse deals only with RDBMS-based approaches. In the field of emerging NoSQL databases, few works have mentioned the technologies on how to formulate the NoSQL data warehouse. Not only is the NoSQL structure different from that of the RDBMS, but NoSQL also has a larger data than the RDBMS in practice. This paper deals with the design of the data warehouse from the NoSQL databases in order to propose a solution for the integration of unstructured data in the data warehouse. This article is structured as follows. In Section 2, we present some works of the literature reviews that deals with the use of unstructured data sources. In Section 3, we explain the different steps of our approach of designing a data warehouse schema from NoSQL databases. In section 4, we present the document-oriented database model. We describe the phase of the schema extraction from the document-oriented database in section 5. In section 6, we present the phase of the structure identification graph. In section 7, we present the multidimensional concepts and the modeling schema of the data warehouse. Finally, we conclude our paper in section 8.

2. Related Work The importance of unstructured data for the decision-making process has been highlighted in many studies [1], [10]. According to these authors, a small percentage of structured corporate data is stored in traditional data warehouses, while the vast majority is unstructured and is recorded in emails, memos, notes, data centers, online social networking calls, Internet forums and discussion forums. Also, according to the Unitas white paper, 80% of organization’s data consists of semi-structured and unstructured data [6]. The authors in [13] proposed an extraction, transformation and loading strategy to integrate unstructured data into the data warehouse. The proposed approach is based on the MapReduce programming system for ETL flow parallelization and the Hadoop paradigm. Unstructured data is extracted using the MapReduce program and stored in the Hadoop distributed file system, where it is processed, analyzed, and transformed by applying certain filters. Finally, the processed data is loaded into the organization’s data warehouse. Following the strategy proposed by [13], the author in [11] proposed a monograph that addresses the need to store unstructured data in the data warehouse and the problems associated with storing this data in the data warehouse. This author proposed also an effective strategy for integrating this unstructured data into the organization’s internal data warehouse. In this monograph, the author examined the factors that drive and motivate organizations from the factors that motivate organizations to integrate unstructured data into their internal data warehouse. Although most NoSQL databases are schema-free, information about the structural properties of persistent data is essential when developing applications. Otherwise, access to the data simply becomes impracticable. The authors in [8] presented a schema extraction algorithm operating outside the NoSQL data store. This method specifically targets semi-structured data stored in NoSQL stores, for example, in JSON format. Rather than designing the schema from the start, retrieving an after schema can be considered a reverse engineering step. Based on the extracted schema information, the authors proposed a set of similarity measures that capture the degree of heterogeneity of the JSON data and reveal outlier structural values in the data. Indeed, in [8], the construction of the database can only be applied to a single collection of documents.

Senda Bouaziz et al. / Procedia Computer Science 159 (2019) 221–230 Senda Bouaziz et al. / Procedia Computer Science 00 (2019) 000–000

223

Furthermore, the authors in [3] have treated the process of extracting schema from a document-oriented database. For this, they advocated a ”hot” approach, which consists of building the schema when the exploitation of databases. This schema extraction process starts from an empty NoSQL database; the schema is recalculated each time the user submits an insert request. These authors proposed to extract the schema from a document-oriented database to allow users (including decision-makers) to express their query but not for the design of the data warehouse. Nowadays, people have started the development of tools and techniques to process unstructured data. In [12], the authors improved the ETL process on unstructured data in data warehousing using the MapReduce paradigm. These authors, therefore, performed the ETL process on these data. The extraction phase is realized to treat the cleaning and profiling of the data. Then, the transformation phase is performed with Map-Reduce so that the data is partitioned and processed in parallel and quickly. The result of the transformation phase will be stored in HDFS. For further processing, Pig Latin is used to query the data. The output will be loaded into the data warehouse for Business Analysis. The absence of a single schema adds to the complexity of analytic applications, in which a single analysis often involves large sets of data with different schemas. In [4], the authors proposed a technique, called schema profiling, to explain the schema variants of a collection in document-oriented databases by capturing the hidden rules explaining the use of these variants. Schema profiling is introduced as a means of identifying the hidden rules that govern the use of different schemas in a collection. Also, In [5], the authors proposed an original OLAP approach to collections stored in document-oriented databases. The basic idea is to stop struggling against the variety of schemes and to welcome them as an inherent source of information richness in non-schema sources. This approach is based on four steps: schema extraction, schema integration, functional dependencies enrichment, and querying. These authors in [4], [5] used document-oriented databases as data sources to address the heterogeneity of data structures that are stored within a single collection. In our daily use, about 80-90% of the data is unstructured data and cannot be processed directly. Notice that, some works deals with the creation of a data warehouse from unstructured data sources. So, that implies the need for more investigation in this area. It is also remarkable that, when using unstructured sources (for example the NoSQL databases), the majority of the works did not handle the schema extraction task to define the data warehouse schema.

3. Approach of a data warehouse schema modeling from NoSQL data sources: The main steps We propose in this paper an approach to design a data warehouse schema from NoSQL databases. As depicted in Fig.1, our approach is composed in five main steps: • Step-1: Choice the NoSQL Databases. NoSQL databases can be classified onto two types: Aggregate-Oriented Databases (either document-oriented, column-oriented, or key-value) and Graph-oriented database. It is important to choose the type of NoSQL databases on which the extraction will be done. • Step-2: Extract the schema. NoSQL databases are schema-free for heterogeneous data storage. More precisely each record of data can have a structure different from another record in the database. In this step, it is necessary to extract the schemes from each record. • Step-3: Identify the Structure Graph (SG). In this step, we define the structure graph of each schema extracted in the step 2 ”Extract schema”. This step helps the designer to identify the multidimensional concepts when modeling the data warehouse schema. • Step-4: Identify the Multidimensional Concept (MC). This step is performed by the designer. This latter identifies the multidimensional concepts from the structure graph. • Step-5: Design the Data Warehouse Schema. This final step can be automated. It ensures the modeling of the data warehouse based on the identified multidimensional concepts. Recall that, the aggregate-oriented NoSQL databases consist of three types: document-oriented databases, columnoriented databases and key/value databases. In this paper and due to the page limitation, we apply our approach to one type of the NoSQL databases which is the document oriented database as an unstructured sources to define our data warehouse schema.

224

Senda Bouaziz et al. / Procedia Computer Science 159 (2019) 221–230 Senda Bouaziz et al. / Procedia Computer Science 00 (2019) 000–000

Fig. 1. Steps of design a data warehouse schema from the NoSQL databases.

In the following, we start by recalling the structure of a document-oriented databases. After that, we apply and detail the different steps presented in Fig.1 onto two selected databases. 4. Document-oriented database Model The document-oriented database model is based on the key/value paradigm. The unit of information stored here is the document that encodes and encapsulates the data in a semi-structured format (such as XML or JSON). A document database usually stores data in JSON or XML format. This type of database considers the document as a whole instead of dividing the document into several key/value pairs. Documents of different structures can be stored in the same set. Document-oriented database support indexes, including not only the primary identifiers, but also the properties of the materials. The most popular implementations are CouchDB from Apache, RavenDB and MongoDB. A document-oriented database can be queried on all components of the defined schema, while a key/value database can only be queried on its key. The document-oriented database consists of a collection of documents. A document is composed of fields and associated values (cf. Fig.2). A document is considered a hierarchy of elements that can be either atomic values or composite values (multiple atomic values or embedded documents).

Fig. 2. Structure of a document-oriented database

Like any NoSQL database, a document-oriented database is called ”schema-free”. In other words, it is not necessary to define a priori all the fields that describe a document. Thus, the document in a collection that can be heterogeneous. The big advantage of this type of databases is its reading performance. Indeed, it is able to recover, via a single key, a set of structured information in a hierarchical manner.

Senda Bouaziz et al. / Procedia Computer Science 159 (2019) 221–230 Senda Bouaziz et al. / Procedia Computer Science 00 (2019) 000–000

225

As an example of a document oriented database, we found the well-known DBLP database and LINKEDIN database. These databases as used in our work as a real document oriented databases. 5. Schema extraction from the document-oriented database The schema of a database is an essential knowledge element for data manipulation. Indeed, the knowledge of the schema of the database proves necessary, even indispensable, for the identification of the multidimensional concepts in order to create the data warehouse. To efficiently extract the schemes from a document-oriented database, analyze the huge amount of input data and speed up the formulation of the data warehouse, we use the MapReduce paradigm [9]. Currently, MapReduce is a functional programming model for processing large sets of data with a paralleldistributed algorithm on a cluster. It is a programming framework, in Hadoop, that can easily be used to handle very large datasets. MapReduce has gained popularity for its ease, efficiency and ability to control ”Big Data” in a timely manner. In this model, the programmer specifies two-step processing using a Map() function and a Reduce() function. The MapReduce system runs on a cluster-type platform, then automatically parallelizes the process by cutting the process into a sub-process where each will be assigned to a node (Map function running on a cluster machine), the partial results of which will be submitted to the reducers ( Reduce functions performed on a cluster machine) to render the final result [2]. The document-oriented model consists of three components: Collection, Document and Attributes. Therefore, at the schema extraction step, we must take into consideration these three components. In order to understand the principle of the proposed approach. An example of schema extraction from a document-oriented NoSQL database is shown in the MapReduce model (cf. Fig.3).

Fig. 3. Extracting Schema in a MapReduce Model from a NoSQL Document-Oriented

The use of the MapReduce paradigm differs according to the type of database that will be used. In this step, for the oriented document database, we define the two functions ”Map” and ”Reduce” as follows: • Map (): Take as input a set of collection ”Key, Values” Returns an intermediate list of ”Key, Null” Map(key,value) → list(key, null) • Reduce (): Input an intermediate list of ”Key, Null” Output a set of ”Key 1, Key 2, Key 3,....,Key n” Reduce(list(key), null) → key 1, key 2, key 3,....,key n

Senda Bouaziz et al. / Procedia Computer Science 159 (2019) 221–230 Senda Bouaziz et al. / Procedia Computer Science 00 (2019) 000–000

226

Algorithm 1 and Algorithm 2 are defined in order to extract all attributes from document-oriented database in a hierarchical order by using the defined MapReduce functions. Algorithm 1: Function (Extract attributes) Input: V CN : Vector that contains the names of all collections Output: At Map = map < C N, Att C > with C N: Name of the current collection Att C: List of attributes of C N 1 2 3 4 5 6 7

foreach DataBase DB do Browse the whole DB foreach Collection C N in V CN do Map(Key,Value)→(Key, Null) Reduce(Key, Null)→Key At Map(C N, Reduce) return At Map

Algorithm 2: Function Extract attributes of embedded document Input: At Map=Result of Function Extract attributes Output: Map Emb = map < Att, L Emb > with Att: Name of the Embedded document L Emb: List of attributes of Att 1 2 3 4 5

foreach Collection C N in At Map do foreach Att in At Map do while Att is Embedded document do Map Emb(Att,Extract Fuctions(Att)) return (Map Emb)

At this level, we apply the MapReduce paradigm to extract the different attributes of each collection. We extract all the attributes of each document in the specified collection using the Map function. Then, we merge all the attributes and we eliminate all redundancies using the ”reduce” function. The proposed algorithms are applyed to two documents-oriented NoSQL databases named ”DBLP” and ”LINKEDIN”. The database ”DBLP” contains a single collection named ”dblp”. Fig.4 shows an excerpt of the schema extraction of a ”DBLP” database with the MapReduce function. The database ”LINKEDIN” contains more than one collection. We take for example the schema extraction of the two collections ”Multiple Facebook” and ”Multiple Blogger”. Fig.5 presents an excerpt of the schema extraction of these two collections. 6. Identifiy the Structure Graph After schema extraction (Step-2) with the MapReduce paradigm, we propose to build the Structure Graph (Step-3) to model all the attributes extracted from the database as a graph. Based on the work of [8], we try to model and store all schemes of all collections in a Structure Graph (SG). A Structure Graph summarizes the attributes of all the outputs of the ’Schema Extraction’ step. In the following, we define our internal graphical structure for schema extraction, with a node and edge labels that capture the complete structure of the schema. Structure Identification Graph from the schema extraction “dblp” First, we build the SG from the attributes extracted from the document-oriented NoSQL database. The main node of

Senda Bouaziz et al. / Procedia Computer 159 (2019) 221–230 Senda Bouaziz et al. / Procedia Computer Science Science 00 (2019) 000–000

227

Fig. 4. Example of schema extraction from a DBLP database

Fig. 5. Example of schema extraction of the ”Multiple Facebook” and ”Multiple Blogger” collections

this graph is the node of the collection. The child nodes of this graph are the attributes that are extracted from all different documents in the collection. The database ”DBLP” contains a single collection ”dblp” which represents the main node. The node numbers specify in which order the SG is built. For example, Fig.4 contains the following attributes: ”id”, ”type”, title”, ”pages”, ”publisher”, ”series” and ”isbn”. The attribute ”pages” contains two sub attributes ”start” and ”end”. These attributes are subsequently modeled as a structure graph. Fig.6 shows the structure identification graph of Fig.4 presented previously. The final Structure Identification Graph of the global DBLP database is shown in Fig.7.

228

Senda Bouaziz et al. / Procedia Computer Science 159 (2019) 221–230 Senda Bouaziz et al. / Procedia Computer Science 00 (2019) 000–000

Fig. 6. Structure Identification Graph of the output in Fig.5.

Fig. 7. Structure Identification Graph of the document-oriented DBLP Database

Structure Identification graph from the schema extraction “LinkedIN” To build the structure identification graph of a document-oriented NoSQL database that contains more than one collection, first, we represent the structure graph of each collection elsewhere and then we merge all the structure graph to find the final structure graph of the chosen database. For example, Fig.8 shows the structure identification graph of Fig.5 presented previously. Fig.8.(a) represents the structure identification graph of the Multiple Facebbok collection and Fig.8.(b) represents the structure identification graph of the Multiple blogger collection.

Fig. 8. Example of the Structure Identification Graph Multiple facebook and Multiple blogger.

Senda Bouaziz et al. / Procedia Computer Science 159 (2019) 221–230 Senda Bouaziz et al. / Procedia Computer Science 00 (2019) 000–000

229

After building the two structure graphs of the ”Multiple Facebook” and ”Multiple Blogger” collections, we merge these two structures to build the final structure graph (cf. Fig.9). Merge brings together all the nodes of the structures. If a node exists in both structures, then merges it into a single node and takes the information from both nodes with the assignment of the document key and the node number.

Fig. 9. Global Structure Identification Graph of the two collection ”Multiple Facebook” and ”Multiple Blogger”

7. Identification of Multidimensional Concepts and Schema modeling The final phase of our approach is the design of data warehouse schemas. This step consists in identifiying the multidimensional concepts from the structure graph then the modeling of the data warehouse schema. That is, it needs to merge Step-4 and Step-5 of our process since the designer cannot create the data warehouse without identifying the different multidimensional concepts (Fact, Measure, Dimension and Hierarchy. The Structure Graph helps the designer to identify the multidimensional concepts. An overview of the initial DW schema is shown in Fig.10 and Fig.11. The data warehouse schema structure of the DBLP source, includes a single DBLP fact. It is linked to Authors, Publisher, Publication and Date as dimensions.

Fig. 10. Overview of the data warehouse schema DBLP

The data warehouse schema from the LINKEDIN source (cf. Fig.11), includes a LinkedIN fact. It is linked to Location, User, Interest and Date as dimensions. 8. Conclusion Traditional data warehouses are unable to meet the growing needs of the modern enterprise to integrate and analyze a wide variety of data stored in NoSQL databases. For that, we propose in this paper, an approach that designs the

230

Senda Bouaziz et al. / Procedia Computer Science 159 (2019) 221–230 Senda Bouaziz et al. / Procedia Computer Science 00 (2019) 000–000

Fig. 11. Overview of the data warehouse schema LINKEDIN

data warehouse schema from document oriented databases. Our proposed approach is composed of five main steps which are: choice the NoSQL database, schema extraction, structure graph identification,multidimentional concepts definition and modeling of the data warehouse schema. In the second step, we proposed two algorithms that ensure the extracted all the attributes of the document oriented database based on the Map and Reduce paradigm. We illustrate our approach in the well-known databases are ”DBLP” and the ”LINKEDIN”. As future work we are going to include other types of aggregate-oriented databases such as: Column Oriented Database and Key/Value Database. References [1] Alqarni, A.A., Pardede, E., 2012. Integration of data warehouse and unstructured business documents. 2012 15th International Conference on Network-Based Information Systems , 32–37. [2] Bala, M., Alimazighi, Z., 2013. Mod´elisation de processus etl dans un mod`ele mapreduce, in: 7 e` me e´ dition de la Conf´erence Maghr´ebine sur les Avanc´ees des Syst`emes D´ecisionnels (ASD’13). doi:10.13140/2.1.3624.3207. [3] Brahim, A.A., Ferhat, R.T., Zurfluh, G., 2018. Extraction du sch´ema d’une BD nosql orient´ee documents, in: Business Intelligence & Big Data, 14`eme Edition de la conference EDA, Tanger, Maroc, 4-6 octobar 2018, pp. 313–320. [4] Gallinucci, E., Golfarelli, M., Rizzi, S., 2018a. Schema profiling of document-oriented databases. Information Systems 75, 13 – 25. doi:https: //doi.org/10.1016/j.is.2018.02.007. [5] Gallinucci, E., Golfarelli, M., Rizzi, S., 2018b. Variety-aware olap of document-oriented databases, in: DOLAP, Vienna, Austria. [6] Godika, S., 2015. When data scientists analyze unstructured data, they need to make sense of disparate data sources. URL: https://www. datamation.com/applications/big-data-9-steps-to-extract-insight-from-unstructured-data.html. [7] Gupta, V., Rathore, N.S., 2013. Deriving business intelligence from unstructured data, in: International Journal of Information and Computation Technology, pp. 971–976. [8] Klettke, M., St¨orl, U., Scherzinger, S., 2015. Schema extraction and structural outlier detection for json-based nosql data stores, in: Datenbanksysteme f¨ur Business, Technologie und Web (BTW), 16. Fachtagung des GI-Fachbereichs ”Datenbanken und Informationssysteme” (DBIS), 4.-6.3.2015 in Hamburg, Germany. Proceedings, pp. 425–444. URL: https://dl.gi.de/20.500.12116/2420. [9] Kun Ma, a.A.Y.S., 2016. Intelligent web data management of nosql data warehouse, in: Springer, C. (Ed.), Intelligent Web Data Management: Software Architectures and Emerging Technologies, pp. 21–43. [10] Orobor, I.A., 2016. Integration and analysis of unstructured data for decision making: Text analytics approach. International Journal of Open Information Technologies 4, 82–88. [11] Pasha, M., 2016. Data warehousing and the unstructured data. doi:10.13140/RG.2.2.32713.44642. [12] Saradava, H., Patel, A., Aluvalu, R., 2016. A survey on etl strategy for unstructured data in data warehouse using big data analytics, in: Proceedings of RK University First International Conference on Research and Entrepreneurship (ICRE 2016). [13] Saravana, P., A.M.V.S., 2014. Extract transform and load strategy for unstructured data into data warehouse using map reduce paradigm and big data analytics. International Journal of Innovative Research in Computer and Communication Engineering 02, 7456–7462. doi:10.15680/ IJIRCCE.2014.0212030.

Design a Data Warehouse Schema from Document-Oriented database

Design a Data Warehouse Schema from Document-Oriented database

Recommend Documents