Mining time series data by a fuzzy linguistic summary system

Fuzzy Sets and Systems 112 (2000) 419–432 www.elsevier.com/locate/fss Mining time series data by a fuzzy linguistic summary system Ding-An Chiang ∗ ...

Download PDF

750KB Sizes 21 Downloads 215 Views

Report

PDF Reader
Full Text

Fuzzy Sets and Systems 112 (2000) 419–432

www.elsevier.com/locate/fss

Mining time series data by a fuzzy linguistic summary system Ding-An Chiang ∗ , Louis R. Chow, Yi-Fan Wang Department of Information Engineering, Tamkang University, Tanshui, Taipei, Taiwan Received April 1997; received in revised form December 1997

Abstract In this paper, we are interested in mining the data with natural ordering according to some attributes, and the time series data is one of this kind of data. The problem of mining the time series data is that the quantity at dierent time may be very close or even equal to each other. To solve this problem, we propose a fuzzy linguistic summary as one of the data mining functions in our KDD (Knowledge Discovery in Databases) system to discover useful knowledge from the database. To help users to premine a database, our system also provides a graphic display tool so that the users can predetermine what knowledge could be discovered from the database. To demonstrate that our system works correctly, we use our system to analyze a time series data problem, the resources usage analysis problem, to predict the utilization ranks of dierent resources c 2000 Elsevier Science B.V. All rights reserved. at a speci c time. A linguistic summary; Data mining; Matching degree; Truth value

1. Introduction The technology of database has been used successfully in many applications. Today, the grand challenge of using a database is to generate useful rules from raw data in a database for users to make decisions, and these rules may be hidden deeply in the raw data of the database. Traditionally, the method of turning data into knowledge relies on manual analysis and interpretation, and the manual analysis is becoming impractical in many domains as data volumes grow exponentially. Therefore, to solve the problem of knowledge extraction from a database, dierent data mining approaches and KDD systems have been

∗

Corresponding author. E-mail address: [email protected] (D. Chiang)

proposed, and readers can refer to excellent works [5,7,11] on this subject. In some applications, an object may be only approximately equal to a certain criteria or property in the applications. For example, it is often dicult to classify patients as fully sick [8]. Therefore, the crisp data mining approaches may not be appropriate for these situations. To overcome this situation, we propose a fuzzy linguistic summary as one of the data mining functions in our KDD system. In the proposed system, we use this summary to predict the utilization ranks of dierent resources, which include CPU, real storage, etc., at a speci c time. Consequently, system programmers can tune MVS by moving less important jobs from peak time to o-peak time. In this paper, the approach to the summarization of data belongs to the fuzzy linguistic approach. Using the fuzzy linguistic approach in a fuzzy relational

c 2000 Elsevier Science B.V. All rights reserved. 0165-0114/00/$ - see front matter PII: S 0 1 6 5 - 0 1 1 4 ( 9 8 ) 0 0 0 0 3 - 7

420

D.-A. Chiang et al. / Fuzzy Sets and Systems 112 (2000) 419–432

database is not only that we can handle imprecise queries, but also to summarize data easily. Several researchers have proposed a series of excellent works about the fuzzy linguistic queries in a fuzzy relational database [1,2,9,10,12–14,22]. To nd useful knowledge from data by the summarization, Yager proposed an approach to the summarization of data based on the theory of fuzzy sets [18]. Yager’s approach is useful for both numeric and non-numeric data. It summarizes data with three values: a summarizer, a quantity in agreement, and a truth value. In [8 – 10], a summarizer is also called a property, and a quantity in agreement is also called a quanti er. Similar to Yager’s approach, our fuzzy linguistic summary is also developed by the theory of fuzzy set. However, unlike Yager’s approach, our method, in some cases, when there is no unique property for characterizing a group of objects, can nd out a disjunctive property for characterizing a group of objects. For example, we may conclude “most northern Chinese like pasta or dumpling” but not “most northern Chinese like pasta”. Therefore, the property we nd in the proposed summary is a disjunctive property, F = F1 ∨ · · · ∨ Fm , rather than a single property, Fi , where the length of F is denoted as |F|. However, the disjunctive property, F, may be too general to characterize a group of objects because of losing valuable information; therefore, we have to restrict the length of the property, F, when it is too general. Since it is dicult to predicate what exactly could be discovered from a database and the data mining process is interactive and iterative, it is necessary to include the human in the data mining process [5,7]. Therefore, our system supports a graphic display tool to help users to premine the database. With the results of the premining process, users can interact with the system so that the fuzzy linguistic summary can automatically determine the property which most objects in a group have from the prede ned properties. Although the data mining process involves numerous steps and all steps are equally important for the successful application of data mining in practice, we focus on the data mining component in this paper. For demonstrating the proposed fuzzy linguistic summary works correctly, we consider an on-line CPU utilization analysis problem, CPU usage analysis problem, during weekdays’ working time in this paper. Since the quantity of the on-line CPU utilization

is ordered by time, the data of the CPU utilization is a kind of time series data. The MVS (Multiple Virtual Storage) is IBM’s operating system for its mainframe computers, and the SMF (System Management Facility) in MVS records important information about every task in the System. With this information, the system programmers can determine how to tune the system performance. However, the data in the SMF is too huge, about 500 MB a day, and too sophisticated to be understood by MVS users; therefore, only the experienced programmers can analyze and use it. To simplify this information, we can further process the data in the SMF to nd some predicting rules. According to these rules, the system programmers can tune the MVS system performance by moving unimportant jobs from peak time to leisure time. Although we consider only the time series data, as is shown by Example 1 in Section 2, our method can also apply to other problems whose data is ordered by some attributes. The remaining of the paper is organized as follows. In Section 2, the theoretical basis of fuzzy logic for the proposed summary is presented. In Section 3, the graphic display tool and the fuzzy linguistic summary of the system are presented. Section 4 is the experimental results. We demonstrate that our method can be applied to the on-line CPU utilization analysis problem. Conclusions and future research are drawn in the nal section. 2. Preliminaries In this section, we brie y introduce some preliminary results and de nitions that are useful for later discussion. 2.1. The fuzzy relational model To store data, we have to de ne our fuzzy relational model. Mathematically, a fuzzy relational schema R(A1 ; : : : ; An ; r ) is made up by a relation name R and a list of attributes A1 ; A2 ; : : : ; An ; r . Each attribute Ai is the name of a role played by some domain, Dom(Ai ), and r is characterized by the following membership function: r : Dom(A1 ) × · · · × Dom(An ) → [0; 1]:

D.-A. Chiang et al. / Fuzzy Sets and Systems 112 (2000) 419–432

De nition 1 (Chiang et al. [6]). Let R(A1 ; : : : ; An ; r (t)) be a fuzzy relational schema. An n-ary fuzzy relation r over R is a set of Dom(A1 ) × · · · × Dom(An ); t[Ai ] refers to the value in t for attribute Ai , and each tuple, t, is an element of Dom(A1 ) × · · · × Dom(An ). That is, r = {(t; r (t)) | t = ((t[A1 ]; 1 (t[A1 ])); : : : ; (t[An ]; n (t[An ]))); and for i = 1; : : : ; n; t[Ai ] ∈ Dom(Ai ); i (t[Ai ]) ∈ [0; 1]; and r (t) = min(1 (t[A1 ]); : : : ; n (t[An ]))}: 2.2. A fuzzy relation and a fuzzy set

421

respect to y can be de ned as: EQ (x; y)    0 for 1 − abs( j (x) − j (y))¡j ; = 1 − abs(j (x) − j (y))   for 1 − abs(j (x) − j (y))¿j ; where j is a prede ned threshold value. It is not the unique way of de ning the fuzzy relation EQUAL(EQ) in the fuzzy relational database. For example, in the similarity-based fuzzy relational database, EQUAL(EQ) relation can be de ned as: EQ (x; y) = sim(x; y), where the similarity relation, sim(x; y), describes the similarity degree among x and y. Accordingly, the fuzzy set can be de ned as:

In this paper, we use a fuzzy relation EQUAL(EQ) to compute the membership degree of an object with respect to a given criteria, ’(A). The fuzzy relation EQUAL(EQ) is de ned by the following de nition.

De nition 3. Let the criteria ’(A) = ’(Ai ) and ’(Ai ) = yi . Then, a fuzzy set X with respect to the criteria ’(A) in a fuzzy relation r can be de ned as:

De nition 2 (Raju and Majumdar [14]). A fuzzy relation EQUAL(EQ) over Dom(Aj ) is characterized by the membership function EQ , where EQ satis es the following conditions.

where r : r → [0; 1] is its membership function, and r (t) is membership degree (truth value) of tuple t in X with respect to the criteria ’(A).

For all x; y ∈ Dom(Aj ); EQ (x; x) = 1 and EQ (x; y) = EQ (y; x): According to Zadeh’s possibility theory [20,21], EQ (x; y) can be interpreted as the possibility of treating the two values (x; (x)) and (y; (y)) equally under the same fuzzy term. In this paper, the value of EQ (x; y) is also called the matching degree of x to y under the same fuzzy term. Buckles and Petry [3] have pointed out that a tuple’s membership value is not a static value, but a measurement of the appropriateness of the tuple to the given criteria, ’(A). Therefore, the membership value of a tuple should be dynamically created according to the given criteria. Since an object may partially match a certain property (criteria) in many situations, we de ne a prespeci ed threshold value, j , and let r (ti )¿j . When r (ti )¡j , we assume that the tuple ti does not satisfy to the given property. For example, let ’(A) = y. Then, a fuzzy relation EQUAL(EQ) over an attribute domain, Dom(Aj ), with

X = {(t; r (t)) | t ∈ r and r (t) = EQ (t[Ai ]; yi )};

2.3. A calculus of linguistically quanti ed proposition As pointed out by Yager [19], a linguistic summary is a linguistically quanti ed proposition containing meta-knowledge about a set of particular objects, and it is useful in database discovery. Dierent approaches for the calculus of linguistically quanti ed propositions have been proposed. For example, Zadeh proposed a computation method to evaluate quanti ed statements [22]; Yager proposed competitive aggregation [16] and OWA method to quanti ed statements [17]. In this section, we use Zadeh’s linguistically quanti ed proposition because of its simplicity and intuitive appeal [8]. The general form of a linguistically quanti ed proposition is usually written as: QX are F; where Q is a (fuzzy) linguistic quanti er, X is a set of objects, and F is a property.

422

D.-A. Chiang et al. / Fuzzy Sets and Systems 112 (2000) 419–432

To a relational database, let “Q{t1 ; : : : ; tn } are F ” be a linguistically quanti ed proposition, t1 ; : : : ; tn be a set of tuples in a fuzzy relation r. Then, according to the semantics of the property, the truth value of “Q{t1 ; : : : ; tn } are F ” over the fuzzy relation r can be computed by the following de nition: De nition 4. Let {t1 ; : : : ; tn } be a set of tuples in a group over a relation r. Then !! n 1 X Truth(Q{t1 ; : : : ; tn } are F) = Q r (ti ) ; n i=1

where r (ti ) is a degree membership value of ti with respect to F, and ! n 1 X r (ti ) n i=1

is the average matching degree of the relation r with respect to F. When F = F1 ∨ · · · ∨ Fm ; r (ti ) = maxm j=1 (Fj (ti )), and when F = F1 ∧· · ·∧ Fm ; r (ti ) = minm j=1 (Fj (ti )). Moreover, when F 6= Fj ; r (ti ) = 1 − Fj (ti ). Example 1. Consider the fuzzy relation EMP(Name, Job, Age) in Table 1. The membership functions of “adult”, “ap25”, and most are given as follows: (1 + |x − 30|=6)−1 for x¡30; adult (x) = 1 for 306 x, ap(25) (x) = (1 + |x − 25|=6)−1 ;  for x¿0:85;  1 most (x) = 2x − 0:7 for 0:35¡ x¡0:85;   0 for x60:35, where “ap25” stands for “approximately equal to 25”. Let the prede ned threshold value for “Age” attribute be 0.75. When the property is “Age = adult”, the truth value of “most employees are adults” is: most ((0 + 1 + 0 + 1)=4) = 0:3. Since the validity of this proposition is too small, the property “adult” cannot characterize the employee in the EMP relation. Moreover, since the truth value of “most employees are adults or approximately equal to 25 years old” is: most ((0:75 + 1 + 1 + 1)=4) = 1, we conclude that the

Table 1 The EMP fuzzy relation Name

Job

Age

adult (Age)

ap25 (Age)

r (t)

Murthy Roy Kumar Alex

Engineer Manager Accountant Manager

23 adult 25 36

0.462 1 0.55 1

0.75 0.55 1 0.35

0.75 1 1 1

proposition “most employees are adults or approximately equal to 25 years old” is true. 3. A fuzzy linguistic summary system Our system, which is written by Visual BASIC and Visual C, is implemented on the IBM PC-586. The system includes two major modules, which are the graphic display tool and the data mining agent. The simpli ed architecture of the system is given in Fig. 1. In the proposed system, users can premine the relational database through the graphic display tool. Since the data mining process is an application oriented process, dierent application may need dierent data mining approach. In this paper, we introduce only the fuzzy linguistic summary in the data mining agent. As is shown in Fig. 1, to avoid consuming the resources of the MVS system and reduce the size of the MVS performance data, only the useful data are downloaded and reformatted into a relational database, Microsoft Access, by IBM 3270 emulation card. Consequently, the size of the data is reduced from approximately 500 MB per day to approximately 1 MB per day. The part of the original data and that of the reformatted data are shown in Fig. 2 and Table 2, respectively, where the CPU usage data are stored in the WCPU relation. With the comparison of these two kinds of data, the reformatted data can be more easily understood and used by users. 3.1. A graphic display tool This section provides a brief introduction to the graphic display tool. The graphic display tool is developed to display the performance data of the host system. This graphic display tool is much more than jutting a system monitor. It is a performance tuning

D.-A. Chiang et al. / Fuzzy Sets and Systems 112 (2000) 419–432

423

Fig. 1. The simpli ed architecture of the system.

Fig. 2. The original data in the SMF.

Table 2 The reformatted data in the WCPU relation Date

Date-type Time

CPU-usage Rank cpu (t[CPU])

040196 040196 040196 040196 040196 040196 040196

weekday weekday weekday weekday weekday weekday weekday

34.9 31.7 35.2 26.9 32.3 37.8 37.8

09:00 10:00 11:00 12:00 13:00 14:00 15:00

3 1 4 0 2 5 6

0.8524 0.7033 0.8672 0.5064 0.7302 1 1

and trend analysis tool. It provides window-based interactive dialog boxes, mouse-driven menus and scroll bars. As is shown in Fig. 3, the Input Screen of the system can be separated into three regions for input selection: • Data Interval region, • Time Interval region, • Function Groups region. 3.1.1. Data interval region This includes two text boxes (Start=End), and three combo boxes (Quick Setting, Group By, and Calendar

424

D.-A. Chiang et al. / Fuzzy Sets and Systems 112 (2000) 419–432

Fig. 3. The input screen of the system.

Speci cation). Users can use Start=End or Quick Setting to select the data they are interested in. When Quick Setting is used, the drop down list appears after the users press this box, which includes the data interval keywords: Today, Yesterday; : : : ; and Last Year. In the system, the data between the speci ed Start=End dates can be grouped by day, month, or year. Moreover, the data between the speci ed Start=End dates can also be averaged (or summed) according to the days de ned in the calendar name you speci ed in the Calendar Speci cation combo box. The drop down list displays all the calendar names you de ned at Calendar Icon and includes one default name: None. 3.1.2. Time interval region This includes four double text boxes for Start=End time, and one check box for average=non-average (or sum=non-sum). The left boxes are for start time, and the right boxes are for end time. Each text box also includes a button allowed quick setting. The Avg check box is disabled during any operation examining detailed data. Since the system stores the performance data once per hour, the supported minimum time interval of this system is defaulted as one hour.

3.1.3. Function groups region This contains eight function tabs, each of which opens an input screen designed to examine the relevant function. To demonstrate that the graphic display works, some of these function tabs are brie y introduced as follows: • Storage tab displays page=swap rate diagram, real storage usage diagram, and expended storage usage diagram. According to the display diagrams, the experienced system programmers can evaluate the existing system parameters to eective keep page=swap rate at acceptable rate for real storage. For example, Fig. 4 represents page=swap rate in bar chart, which is grouped by month, from 1996 April to 1996 August. Moreover, we can doubleclick each bar chart to see more detailed information. • DASD (Direct Access Storage Device) volumes tab displays I=O response time diagram, DASD utilization diagram, cache utilization diagram, etc. The experienced system programmers can use these diagrams to search serviceable DASD volumes to reallocated les. For example, Fig. 5 shows DASD I=O response time from 1996 April to 1996 August. • TSO (TimeSharing Option) performance groups tab displays TSO response time diagram, transac-

D.-A. Chiang et al. / Fuzzy Sets and Systems 112 (2000) 419–432

425

Fig. 4. Page=Swap rate from 1996 April to 1996 August.

Fig. 5. DASD I=O response time from 1996 April to 1996 August.

tions count diagram, productivity analysis for TSO users diagram, etc. With the TSO facility, users can immediately access to the computer. However, typical TSO users spend most of their time waiting for various resources. To solve this problem, the experienced system programmers can determine the causes of slow response time from these diagrams. Fig. 6 is a TSO response time diagram from 1996 April to 1996 August. 3.2. A fuzzy linguistic summary For concentrating on presenting our fuzzy linguistic summary function in the data mining agent of the system, we assume that all relevant attributes and data have been extracted according to the corresponding domain knowledge. The domain knowledge is useful

to focus the search for interesting pattern, and it can take on a number of dierent forms. In the system, the corresponding domain knowledge is represented as: {Table A1 ; : : : ; Table Ak } ≈ B; where “≈” stands for “relate to”, and for i = 1; : : : ; k; attribute Table Ai relates to knowledge B. For example, when we want to predict the utilization ranks of CPU at a speci c time, the corresponding domain knowledge with respect to the WCPU relation in Table 2 can be represented as: {WCPU:Time; WCPU:Date-type; WCPU:Rank; WCPU:cpu (t[CPU])} ≈ B: The objective of our fuzzy linguistic summary is to predict the utilization ranks of dierent resources

426

D.-A. Chiang et al. / Fuzzy Sets and Systems 112 (2000) 419–432

Fig. 6. TSO response time diagram from 1996 April to 1996 August.

at a speci c time. Therefore, we design a program to assign the following ranks: Rank-0 → o -peaktime; Rank-1 → much busier than Rank 0; .. .

Rank-6 → busiest time.

Notice that when the rank value is set at 6, it showed the busiest time. In the system, the corresponding property is represented by ISA hierarchy [4]: {Rank-1; : : : ; Rank-6} ⊂ Utilization; where the ranks de ne the degree of the utilization. When we want to introduce a new property into the system, we can add the new property into the corresponding ISA hierarchy. For example, when we introduce the new property “very busy” into the system, the corresponding properties can be represented as: {Rank-1; : : : ; Rank-4; very busy} ⊂ Utilization; {Rank-5; Rank-6} ⊂ very busy: That is, the property “very busy” = Rank-5 ∨ Rank6. In other words, the property “not very busy” = not Rank-5 ∧ not Rank-6. As shown in Fig. 7, the proposed fuzzy linguistic summary includes two phases. Phase 1 is as a user interface, users can de ne the corresponding operations according to the premining results at this phase; at phase 2, the summary will automatically focus the search on interesting rules. We now introduce operations of our summary as follows:

3.2.1. Preclassi cation According to the premining results, users can preclassify objects into dierent groups. Since the data is ordered by some attributes in this paper, generally, one of the goals of mining such data is to nd out the relationship between the ranking of the data and that attribute’s value. Therefore, for the time series data analysis problem, we can preclassify the data with the time attribute, then use our system to discover the relationship between the ranking of data and time. 3.2.2. Selecting the candidate properties for each group by a heuristic function To make process more ecient, for each group we select a set of candidate properties from the prede ned ones by a heuristic function. The property is a candidate one if and only if it may be represented as a rule for characterizing that group. In the system, when the heuristic function is unde ned, all of the prede ned properties are viewed as the candidate properties. Basically, there is no unique way to de ne the heuristic function; therefore, users have to de ne the heuristic function according to the type of prede ned properties and the goal of the application. 3.2.3. Computing the truth values In a real-world application, an object may match a certain property in some degree; therefore, we use the EQ functions to soften this problem with imprecision and errors. In this paper, the system will compute truth

D.-A. Chiang et al. / Fuzzy Sets and Systems 112 (2000) 419–432

427

Fig. 7. The fuzzy linguistic summary.

values of each group with respect to each candidate property in that group by De nition 4. 3.2.4. Generating a new property and setting a parameter Let X be a set of objects in a group. When all truth values of X with respect to candidate properties are less than the prede ned threshold value for the truth value, we merge two candidate properties, which have the most average matching degrees, as a new candidate property, F: In this paper, the new candidate property, F, is represented as a disjunctive property. For example, when F = Fi ∨ Fj , it means that most objects in the group satisfy either Fi or Fj : To avoid generating overgeneralized rules, users have to de ne the maximum length of the property according to the application. When a new candidate property is too general, i.e., the length of the new candidate property is larger than the user de ned

length, we simply ignore this new candidate property; otherwise, go back to compute truth value with respect to this new candidate property. 3.2.5. Selecting rules Not all instances in a group may be covered by one rule; therefore, many researchers consider a small number of unusual cases as noisy or exception data [4]. To handle these unusual cases, we can incorporate quantitative information to each rule as a measure of the strength of the rule. Traditionally, the strength of a rule, p → q, is the ratio of the number of objects satisfying p and q to the number of objects satisfying p in the crisp classi cation problem. In this paper, all objects are preclassi ed with some attributes whose values of the objects are equal to each other in the same group. Consequently, the average matching degree can be viewed as a measure of the strength of that a certain property is held by a group. When the average matching degree of a property for a group is 1, then, objects

428

D.-A. Chiang et al. / Fuzzy Sets and Systems 112 (2000) 419–432

in this group fully match the property; otherwise, some objects in the group may only approximately match the property. If the truth value of X to F is greater than the prede ned threshold value for truth value, then, this F may be the property for characterizing this group. When there are more than one property satisfying the above condition, we select one of the candidate properties with the maximum average matching degree as the property for characterizing X . From the logical point of view, the property F for the group X can be represented as: A1 = a1j ∧ · · · ∧ Ak = akj → property= F (average matching degree); where for i = 1; : : : ; k; aij ∈ Dom(Ai ); and the average matching degree is the strength of the rule. 4. Experimental results In this paper, all data are collected from an IBM mainframe in a bank by the hour. For a bank, the working hours in our country is from 8:00 to 15:30 on weekdays, and from 8:00 to 13:00 on Saturday. Since the data is recorded by the hour, for simplicity, we assume that the working time is from 8:00 to 15:00 on weekdays. For demonstrating our fuzzy linguistic summary works correctly, we consider only an on-line CPU utilization analysis problem in this paper. Furthermore, we use data from 1996 April to 1996 August as training examples, and data from 1996 September to 1996 October as testing examples. 4.1. Premining by the graphic display tool To predicate what exactly could be discovered for the CPU utilization problem, we rst enter the date on the input screen as shown in Fig. 3, and press the OK button for generating the on-line CPU utilization diagram, which is shown in Fig. 8. In Fig. 8, the bar chart and line chart represent the average value and the maximum value of the on-line CPU utilization for each month, respectively. Double-clicking the data series marker area causes the chart shown in Fig. 9 to appear, showing up month

grouped by day. At this moment, we can double-click again each bar to see the on-line CPU usage on the day by hours. Fig. 10 shows a bar chart for 1–24 h on-line CPU utilization on 1 April 1996. After seeing these diagrams, we conclude that it may have a relationship between the on-line CPU utilization and the time from the bar chart diagram. In other words, the knowledge to be discovered is the relationship between the ranking of the on-line CPU utilization and time in this case. 4.2. Data mining by the fuzzy linguistic summary To concentrate on presenting our fuzzy linguistic summary function in the data mining agent of the system, we assume that all relevant attributes and data have been extracted according to the corresponding domain knowledge. We now discuss how to summarize the on-line CPU utilization data to get compact information by the proposed fuzzy linguistic summary. Since the implementation of each step is straightforward, we only show the result of each step in the following discussion. 4.2.1. Phase 1 in the fuzzy linguistic summary 4.2.1.1. Preclassi cation. Since the behavior of working hours on weekdays may be dierent from that on Saturday, we preclassify tuples in the database with the “time” and “date-type” attribute value, so that the tuples are divided into 12 groups. As is shown in Table 3, each group has the same values for the attribute “time” and “date-type,” where Dom(datetype) = {weekday, Saturday, Holiday}. 4.2.1.2. De ning the heuristic function and setting a parameter. In this case, the heuristic rule can be de ned as: Let X be a set of objects in a group, and F1 ; : : : ; Fm be the prede ned properties for the X . Assume that most objects in X have the property Fi . Then, H = number of tuples having property Fi in X = number of tuples in X ; when H ¿ ; Fi is the only candidate property for the group X ; otherwise, Fi−1 , Fi and Fi+1 are the candidate properties for X , where is a pre-speci ed

D.-A. Chiang et al. / Fuzzy Sets and Systems 112 (2000) 419–432

Fig. 8. CPU utilization from 1996 April to 1996 August (grouped by month).

Fig. 9. CPU utilization in 1996 April (grouped by day).

Fig. 10. CPU utilization on 1 April 1996 (grouped by hour).

429

430

D.-A. Chiang et al. / Fuzzy Sets and Systems 112 (2000) 419–432

Table 3. The results of preclassi cation Date-type

Time

Rank

cpu (t[CPU])

Weekday

09:00 .. .

3

0.8524

↓ .. .

Weekday

09:00 .. .

4

0.8466

↑

Weekday

15:00 .. .

6

1

Weekday

15:00 .. .

5

0.9890

↑ Group 1

Group 7 ↓ .. .

threshold value for EQ function in this paper. The advantage of using heuristic function is that we have no more than three candidate properties for each group in this case. To avoid getting a useless rule, we simply ignore the property whose length is greater than two. In other words, the parameter is two in this case. 4.2.2. Phase 2 in the fuzzy linguistic summary As is shown in Table 2, the value cpu (t[CPU]) is the degree of the on-line CPU utilization at a speci c working hour on that day, and the membership function of cpu is given as: cpu (x) = (x=y)2 ; where x is the on-line CPU utilization at a speci c working hour on one day, and y is the highest online CPU utilization at working hours on the same day. Since the data is recorded by hour, the on-line CPU utilization at dierent hours may be very close or even equal to each other. For example, as shown in Table 2, the CPU utilization at 1 April 14:00 is equal to that at 1 April 15:00. Therefore, we assume that the threshold values for EQ function and truth value are 0.85 and 1, respectively. Accordingly, the matching degree of ti with respect to Fj can be computed by the following Fj (ti ) function: Fj (ti ) = EQ (ti ; tj );

EQ (ti ; tj )  0       for 1 − abs(cpu (ti ) − cpu (tj ))¡0:85; =  1 − abs(cpu (ti ) − cpu (tj ))      for 1 − abs(cpu (ti ) − cpu (tj ))¿0:85; where Fj is the rank value of the on-line CPU utilization at a speci c hour, and tj is the tuple of the on-line CPU utilization with ranking Fj on the same day as ti is. By the semantics of the candidate properties, the truth value of “most X are F ” can be computed by De nition 4. For example, Fig. 11 shows that the candidate properties of the data, grouped by 15:00 and weekday, is 4, 5, and 6. 4.2.2.1. Generating a new property and selecting rules. Since the results in Fig. 11 did not satisfy the threshold value for truth value, as shown in the same gure, a new candidate property, 5 ∨ 6, to this group will be generated by the system. Finally, the resultant rules are given in Fig. 12. According to these rules, we know that the peak time during working hours is from 13:00 to 15:00, and the most leisure time is from 11:00 to 12:00 on weekdays. Therefore, we can tune the system performance by moving unimportant jobs from 13:00 –15:00 to 11:00 – 12:00 on weekdays. On the other hand, the on-line CPU utilization is very busy from 12:00 to 13:00 on Saturday. 4.3. Testing results After testing the data from September to October, as shown in Fig. 13, we achieved at least 85% accuracy in predicting the ranking of an on-line CPU utilization during the period of working hours on weekdays. In this case, when we expect higher accuracy, we can raise the threshold value of the EQ function.

5. Conclusion and future research In the proposed system, the graphic display tool can assist users in premining the history data or checking the current situation of the system. The fuzzy linguistic

D.-A. Chiang et al. / Fuzzy Sets and Systems 112 (2000) 419–432

431

Fig. 11. The candidate properties for the data grouped by 9:00 and weekday.

Fig. 12. The resulting rules.

Fig. 13. The testing results.

summary can be used to predict the utilization ranks of the dierent resources at dierent speci c moment. In the proposed system, without considering the graphic display tool, the system can also handle fuzzy series from the beginning. Currently, most parameters are defaulted and users cannot insert rules into the system; therefore, we plan

to develop a more exible interface so that users can de ne dierent parameters by the problem statements and insert the rules directly into the system. Consequently, the system can apply the rules of the user who inserted them. Moreover, we do not consider missing data in this paper. Currently, when they appear, we just simply

432

D.-A. Chiang et al. / Fuzzy Sets and Systems 112 (2000) 419–432

ignore the data of that day. In the future, we will consider the missing data problem in our system. For the purpose of tuning the system performance more precisely, we are going to develop more data mining functions in the future. For example, since the system performance is related to dierent resources at the same time, we plan to use (fuzzy) correlation to analyze the performance data so that system programmers can more precisely determine the causes when the system performance is down. In the current system, we can nd out rules from the raw data, and system programmers can tune MVS by moving less important jobs from peak time to opeak time. Since the rules can be represented as the program sentence such as IF= THEN statements, we plan to integrate these rules into the system directly. Therefore, with incorporating to other domain information such as job priority, the system can automatically tune the system by moving less important jobs from peak time to o-peak time. Moreover, the rules in this paper could be further simpli ed. In the example, rule 6 and rule 7 can be further generalized as: Time = 13:00–15:00 → Rank = 5 ∨ 6 (0:9350): Therefore, we will develop a simpli cation procedure to generalize these rules in the future. References [1] P. Bosc, M. Galibourg, G. Hamon, Fuzzy query with SQL: extensions and implementation aspects, Fuzzy Sets and Systems 28 (1988) 333–349. [2] P. Bosc, O. Pivert, SQLf: A relational database language for fuzzy querying, IEEE Trans. Fuzzy Systems 3 (1) (1995) 1–17. [3] B.P. Buckles, F.E. Petry, Information-theoretical characterization of fuzzy relational databases, IEEE Trans. Syst., Man Cybern. 13 (1) (1983) 72–77. [4] Y. Cai, N. Cercone, J. Han, Attribute-oriented induction in relational databases, in: G. Piatetsky-Shapiro, W.J. Frawley (Eds.), Knowledge Discovery in Databases, AAAI=MIT, Cambridge, MA, 1991, pp. 213–227.

[5] M.-S. Chen, J. Han, P.S. Yu. Data mining: an overview from a database perspective, IEEE Trans. Knowl. Data Eng. 8 (6) (1996) 866–883. [6] D.A. Chiang, L.R. Chow, N.C. Hsien, Fuzzy information in extended fuzzy relational databases, Fuzzy Sets and Systems, accepted for publication. [7] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, The KDD process for extracting useful knowledge from volumes of data, ACM Comm. 39 (11) (1996) 27–34. [8] J. Kacprzyk, C. Iwanski, Fuzzy logic with linguistic quanti ers in inductive learning, in: Fuzzy Logic for the Management of Uncertainty, pp. 465– 478. [9] J. Kacprzyk, S. Zadrozny, A. Ziolkowski, FQUERY III+ : a human-consistent databases querying system based on fuzzy logic with fuzzy linguistic quanti ers, Inform. Systems 14 (6) (1989) 443– 453. [10] J. Kacprzyk, A. Ziolkowski, Databases queries with fuzzy linguistic quanti er, IEEE Trans. Syst., Man Cybern. 16 (1986) 474– 479. [11] C.J. Matheus, P.K. Chan, G. Piatetsky-Shapiro, System for knowledge discovery in databases, IEEE Trans. Knowl. Data Eng. 5 (6) (1993) 903–913. [12] J.M. Medina, M.A. Vila, J.C. Cubero, O. Pons, Towards the implementation of a generalized fuzzy relational database model, Fuzzy Sets and Systems 75 (1995) 273–289. [13] J.M. Medina, M.A. Vila, O. Pons, GEFRED: a generalized model of fuzzy relational databases, Inform. Sci. 76 (1994) 87–109. [14] K.V.S.V.N. Raju, A.K. Majumdar, Fuzzy functional dependencies and lossless join decomposition on fuzzy relational database systems, ACM TODS 13 (1988) 129 –166. [15] V. Tahani, A conceptual framework for fuzzy query processing: a step towards very intelligent database systems, Inform. Process. Management 13 (1977) 289 –303. [16] R.R. Yager, General multiple-objective decision functions and linguistically quanti ed statements, Int. J. Man–Mach. Stud. 21 (1984) 389 – 400. [17] R.R. Yager, On ordering weighted averaging aggregation operations in multicriteria decisionmaking, IEEE Trans. Syst., Man Cybern. 18 (1988) 183–190. [18] R.R. Yager, On linguistic summaries of data, in: G. PiatetskyShapiro, W.J. Frawley (Eds.), Knowledge Discovery in Databases, AAAI=MIT, Cambridge, MA, 1991, pp. 347–362. [19] R.R. Yager, Database discovery using fuzzy sets, Int. J. Intelligent System 11 (1996) 691–712. [20] L.A. Zadeh, Fuzzy sets, Inf. and Control 8 (1965) 338–353. [21] L.A. Zadeh, Fuzzy sets as a basis for theory of possibility, Fuzzy Sets and Systems 1 (1) (1978) 3–28. [22] L.A. Zadeh, A computational approach to fuzzy quanti ers in natural languages, Comput. Math. Appl. 9 (1984) 149 –184.

Mining time series data by a fuzzy linguistic summary system

Mining time series data by a fuzzy linguistic summary system

Recommend Documents