An Introduction to Data Architecture

An Introduction to Data Architecture

CHAPTER 1.1 An Introduction to Data Architecture Data architecture is about the larger picture of data and how it fits together in a typical organiza...

548KB Sizes 1 Downloads 210 Views

CHAPTER 1.1

An Introduction to Data Architecture Data architecture is about the larger picture of data and how it fits together in a typical organization. The natural starting point for looking at the big picture of how data fit together in a corporation begins naturally enough with all the data in the corporation. Fig. 1.1.1 depicts symbolically all the data—of every kind—in the corporation. Fig. 1.1.1 depicts every kind of data found in the corporation. It depicts data generated by running transactions. It depicts e-mail. It depicts telephone conversations. It depicts data found in personal computers. It depicts metering data. It depicts office memos. It depicts contracts, safety reports, and time sheets. It depicts pay ledgers. In a word, if it is data and it is in the corporation, it is depicted by the bar shown in Fig. 1.1.1.

SUBDIVIDING DATA There are many ways to subdivide the data shown in Fig. 1.1.1. The way that is shown is only one of many ways data can be understood. One way to understand the data found in the corporation is to look at structured data and nonstructured data. Fig. 1.1.2 shows this subdivision of data. Structured data are data that are well defined. Structured data are typically repetitive. The same structure of data recurs repeatedly. The only difference between one occurrence of data and another is in the contents of the data. As a simple example of structured data, there are records of the sale of a good—an “SKU”— made by a retailer. Each time Walmart makes a sale the item sold, the amount of the sale, the tax paid, and the date and location of the sale are recorded. In a day’s time, Walmart will create many records of the sale of many items. From a structural standpoint, the sale of one item will be identical to the sale of another item. The data are called “structured” because of the similarity of the structure of the records. 1 Data Architecture. https://doi.org/10.1016/B978-0-12-816916-2.00001-2 © 2019 Elsevier Inc. All rights reserved.

2

C HA PT E R 1 . 1 :

An Introduction to Data Architecture

FIG. 1.1.2 Structured data is only a small part of corporate data.

FIG. 1.1.1 The totality of corporate data.

The high degree of structure and definition of the records make the records easy to handle inside a database management system. However, structured records are hardly the only kind of data in the corporation. In fact, structured data typically represent only a small fraction of the data found in the corporation. The other kind of data found in the corporation is called unstructured data. It has been conjectured as to how much data in the corporation are structured and how much are unstructured. There are estimates as low as 2% and as high as 20%. The estimate really depends on the nature of the business of the corporation and the nature of what data are used in the calculation of the equation.

REPETITIVE/NONREPETITIVE UNSTRUCTURED DATA There are two basic kinds of unstructured data in the corporation—repetitive unstructured data and nonrepetitive unstructured data. Fig. 1.1.3 depicts the different kinds of unstructured data in the corporation. A typical form of repetitive unstructured data in the corporation might be the data generated by an analog machine. For example, a farmer has a machine that reads the identification of railroad cars as the railroad cars pass through the farmer’s property. Trains pass through the property night and day. The electronic eye reads and records the passage of each car on the track. Nonrepetitive unstructured data are data that are nonrepetitive, such as e-mails. Each e-mail can be long or short. The e-mail can be in English or Spanish (or some other languages.) The author of the e-mail can say anything that he/she

FIG. 1.1.3 Repetitive data and nonrepetitive data.

The Great Divide of Data

pleases. It is only a pure accident if the contents of any e-mail are identical to the contents of any other e-mail. And there are many forms of nonrepetitive unstructured data. There are voice recordings, there are contracts, there are customer feedback messages, etc. Because of its irregular form, unstructured data do not fit well with standard database management systems.

THE GREAT DIVIDE OF DATA It is not obvious at all, but the dividing line in unstructured data between unstructured repetitive data and unstructured nonrepetitive data is very significant. In fact, the dividing line between unstructured repetitive data and unstructured nonrepetitive data is so important that the division can be called the “great divide” of data. Fig. 1.1.4 shows the great divide of data. It is hardly obvious why there should be this great divide of data. But there are some very good reasons for the divide: Repetitive data usually have very limited business value, while nonrepetitive data are rich in business value. Repetitive data can be handled one way; nonrepetitive data are handled very differently. Repetitive data can be analyzed one way, while nonrepetitive data can be analyzed in a very different manner. And so forth. The two worlds—of repetitive data and of nonrepetitive data—are as different as chalk and cheese. Tools and techniques that work in one world simply are not applicable to the other world and vice versa. In many ways, the great divide of data is as profound as the continental divide. In the continental divide, snow that falls on one side of the divide ends up as

FIG. 1.1.4 The great divide.

3

4

C HA PT E R 1 . 1 :

An Introduction to Data Architecture

FIG. 1.1.5 The great divide.

water that flows to the Pacific Ocean, whereas snow that falls on the other side of the divide ends up heading for the Atlantic Ocean. Fig. 1.1.5 shows the continental divide.

TEXTUAL/NONTEXTUAL DATA The unstructured nonrepetitive data can be further subdivided. Nonrepetitive unstructured data can be divided into textual and nontextual data. Fig. 1.1.6 shows this further subdivision of data. Textual data are that data that are embodied in the form of text. An obvious example is e-mail or contract data. An e-mail is nothing but text, and a contract is nothing but text. Nonrepetitive nontextual data might be the picture an insurance adjuster takes of a car after it has been in an accident. Or the real estate agent may make a video tape of a house that is for sale.

FIG. 1.1.6 Textual and nontextual nonrepetitive data.

Business Value

THE DIFFERENT FORMS OF DATA The basic divisions of data that are shown in Fig. 1.1.6 are important for a lot of reasons. Each of the divisions of data requires their own infrastructure, their own technology, and their own treatment. Even though all forms of data exist in the same corporation, each of the forms of data may as well exist on different planets. They simply require their own treatment and their own unique infrastructure.

BUSINESS VALUE There are then many reasons for the different treatment of the different forms of data. But perhaps the most salient reason for the difference in the forms of data is the relationship to business value. Fig. 1.1.7 shows that there is a very different relationship to business value across the different forms of data. Fig. 1.1.7 shows that there is a very high degree of business value for structured data. As an example of the value of structured data, it is really important to the business to have the correct bank account balance, both to the bank and to the customer. Textual data contain even more highly valued business data. When customers talk to an agent of the company through a call center, everything the customer says is valuable. And there is significantly less business value for nonrepetitive nontextual data and unstructured repetitive data.

FIG. 1.1.7 Business value varies dramatically across different types of data.

5