The Life Cycle of Data – Understanding Data Over Time

The Life Cycle of Data – Understanding Data Over Time

THE LIFE CYCLE OF DATA – UNDERSTANDING DATA OVER TIME 1.6 Data in the corporation has a predictable life cycle. The life cycle applies to most data...

840KB Sizes 0 Downloads 52 Views

THE LIFE CYCLE OF DATA – UNDERSTANDING DATA OVER TIME

1.6

Data in the corporation has a predictable life cycle. The life cycle applies to most data. There are, however, a few exceptions in that some data does not follow the life cycle that will be described. The life cycle of data looks like the diagram shown in Figure 1.6.1 and Figure 1.6.2. The life cycle of data shows that raw data enters the corporate information systems. The entry of raw data can be made in many ways. The customer may do a transaction and the data is captured as a by-product of the transaction. An analog computer may make a reading and the data is entered as part of the analog processing. A customer may initiate an activity (such as make a phone call) and a computer captures that information. There are many ways that data can enter the information systems of the corporation. After the raw detailed data has entered the system, the next step is that the raw detailed data passes through a capture/edit process. In the capture/edit process, the raw detailed data passes through a basic edit process. In the edit process, the raw detailed data can be adjusted (or even rejected). In general, the data that enters the information systems of the corporation is at the most detailed level. After the raw detailed data has passed through the edit/capture process, the raw detailed data then goes through an organization process. The organization process can be as simple as simple indexing the data. Or the raw detailed data may be subjected to an elaborate filtering/calculation/merging process. At this point the raw detailed data is like putty that can be shaped many ways by the system designer. Once the raw detailed data has passed through the organization process, the data is then fit to be stored. The data can be stored in a standard database management system (DBMS) or in Big Data (or in other forms of storage). After the data is stored and before the data is fit for analysis, it typically passes through an integration  

33

34  Chapter 1.6  The life cycle of data – understanding data over time

Figure 1.6.2

Figure 1.6.1

process. The purpose of the integration process is to restructure the data so that it is fit to be combined with other types of data. It is at this point that the data enters the cycle of usefulness. The cycle of usefulness will be discussed at length later. After the data has fulfilled its usefulness, the data can be either archived or discarded. The life cycle of data that has been described is for raw detailed data. There is a slightly different life cycle of data for summarized or aggregated data. The life cycle of summarized or aggregated data is shown in Figure 1.6.3. The life cycle for most summarized or aggregated data begins the same way that raw detailed data begins. Raw data is ingested into the corporation. But once that raw data becomes a part of the infrastructure, the raw data is accessed, categorized, and calculated. The calculation is then saved as part of the information infrastructure, as shown in Figure 1.6.3. Once raw and summarized data become part of the information infrastructure, the data is then subject to the “curve of usefulness.” The curve of usefulness states that the longer data remains in the infrastructure, the less likely it is that the data will be used in analysis. Figure 1.6.4 illustrates that when looked at from the standpoint of age, the fresher data is, the greater the chances are that the data will be accessed. This phenomenon applies to most types of data found in the corporate information infrastructure. As data ages in the corporate information infrastructure, the probability of access drops. The older data, for all practical purposes, becomes “dormant.” The phenomenon of data becoming dormant is not quite as true for structured online data.

Figure 1.6.3



Chapter 1.6  The life cycle of data – understanding data over time   35

Figure 1.6.4

There are certain types of businesses where the phenomenon of data aging does not hold true. One type of industry is the life insurance industry, where actuaries are regularly looking at data that is more than 100 years old. And in certain scientific and manufacturing research organizations, there may be great interest in results that were generated more than 50 years ago. But most organizations do not have an actuary or a scientific research facility. For those more ordinary organizations, the focus is almost always on the most current data. The declining curve of usefulness can be expressed by a curve, as shown in Figure 1.6.5. The declining curve of usefulness states that over time, the value of data decreases, at least insofar as the probability of access is concerned. Note that the value never actually gets to zero. But after a while, the value nearly approaches zero.

Figure 1.6.5

36  Chapter 1.6  The life cycle of data – understanding data over time

Figure 1.6.6

Figure 1.6.7

At some point in time, the value is so low that for all practical purposes it might as well be zero. The curve is a rather sharp curve – a classical Poisson distribution. An interesting aspect of the curve is that the curve is actually different for summary and detailed data. Figure 1.6.6 shows the difference in the curve for detailed data and summary data. Figure 1.6.6 shows that the declining curve of usefulness for data is much steeper for detailed data than it is for summary data. Furthermore, over time the usefulness of summary data goes flat but does not approach zero, whereas the curve for detailed data indeed, does approach zero. And in some cases the curve for summarized data over time starts to actually grow, although at a very incremental rate. There is another way to look at the dormancy of data over time. Consider the curve that expresses the accumulation of data over time. This curve is shown in Figure 1.6.7, which shows that over time the volume of data that accumulates in the corporation accelerates. This phenomenon is pretty much true for every organization. Another way to look at this accumulation curve is shown in Figure 1.6.8, which shows that as data accumulates over time in the corporation, that there are different and dynamic bands of usage of data. There is one band of data that shows that some data is heavily used over time. There is another band of data for lightly used data. And there is yet another band of data for data that is not used at all. As time passes these bands of data expand. Usually the bands of data relate to the age of the data. The younger the data is, the more relevant the data is to the current business of the corporation. And the younger the data is, the more the data is accessed and analyzed. When it comes to looking at data over time, there is another interesting phenomenon that occurs. Over long periods of time, the integrity of data degrades. Perhaps the term “degrades” is not appropriate because there is a pejorative sense to it. As used here the term has no such pejorative connotation. Instead, as used here,



Figure 1.6.8

Chapter 1.6  The life cycle of data – understanding data over time   37

Figure 1.6.9

the term simply means that there is a natural and normal decay of meaning of data over time. Figure 1.6.9 shows the degradation of integrity of data over time. To understand the degradation of integrity over time, let’s look at some examples. Let’s consider the price of meat – say hamburger – over time. In 1850 hamburger was $0.05 per pound. In 1950 the price of hamburger was $0.95 per pound. And in 2015 the price of hamburger is $2.75 per pound. Does this comparison of the price of hamburger over time make sense? The answer is sort of. The problem is not in the measurement of the price of hamburger. The problem is in the currency by which hamburger is measured. Even the meaning of what is a dollar is different in 1850 than what a dollar is in 2015. Now let’s consider another example. The stock price of one share of IBM was $35 in 1950 and the price of that same share of stock in 2015 is $200 a share. Is the comparison of a stock price over time a valid comparison? The answer is sort of. IBM in 2015 is not the same company as it was in 1950 in terms of products, customers and revenues, and in the value of the dollar. In a hundred ways, there simply is no comparison of IBM in 1950 and IBM in 2015. Over time the very definition of the data has changed. So while a comparison of IBM’s stock price in 1950 versus the stock price in 2015 is an interesting number, it is a completely relative number, because the very meaning of the number has drastically changed. Given enough time, the very definition of values and data changes. That is why degradation of the definition of data is simply a fact of life.