Journal of Systems Engineering and Electronics Vol. 19, No. 3, 2008, pp.606–610
Novel method for the evaluation of data quality based on fuzzy control∗ Ban Xiaojuan1 , Ning Shurong1 , Xu Zhaolin1 & Cheng Peng2 1. School of Information Engineering, Univ. of Science & Technology Beijing, Beijing 100083, P. R. China; 2. Beijing Inst. of Electronic System Engineering, Beijing 100854, P. R. China (Received August 18, 2006)
Abstract: One of the goals of data collection is preparing for decision-making, so high quality requirement must be satisfied. Rational evaluation of data quality is an effective way to identify data problem in time, and the quality of data after this evaluation is satisfactory with the requirement of decision maker. A fuzzy neural network based research method of data quality evaluation is proposed. First, the criteria for the evaluation of data quality are selected to construct the fuzzy sets of evaluating grades, and then by using the learning ability of NN, the objective evaluation of membership is carried out, which can be used for the effective evaluation of data quality. This research has been used in the platform of ‘data report of national compulsory education outlay guarantee’ from the Chinese Ministry of Education. This method can be used for the effective evaluation of data quality worldwide, and the data quality situation can be found out more completely, objectively, and in better time by using the method.
Keywords: data quality, evaluation system, fuzzy control theory, neural network.
1. Introduction Software and data cannot be separated but holistic. The cost and failure risk of software can be reduced enormously by the early detection and correction of data problems in the running process of the software. Therefore, it is very important to develop a system of effective scientific data quality evaluation to guarantee the maximal data quality at the minimal cost. There are many criteria for the evaluation of data, but majority of them do not work properly, which significantly affects the accurate evaluation of data quality. The satisfactory estimation of data quality characters is blurry, and the weights of these quality properties should be different in various evaluations according to the difference in application fields and intentions[4] . If the weights of data properties are mainly considered, there must exist certain subjectivity. Thereby, this research takes the project from the ministry of education which called the platform of ‘data report of national compulsory education outlay guarantee’
as background, and a data quality evaluation method will be proposed according to the platform on which the data should be filled in by every section, then be audited, gathered and reported to the leadership. By integrating the structural knowledge expression ability of fuzzy logical illation and the self-learning ability of neural network (NN), this method expresses the fuzzy properties in the evaluation of data quality by fuzzy logical illation; determines the memberships and weight values of quality characters by NN. It avoids the subjectivity in the determination of the weight values of quality characters, and the evaluation is more rational and objective.
2. Evaluation of data quality 2.1
Definition of data quality
Poor data quality usually makes people to consider that the data must be wrong and unavailable. Therefore, the data quality problem is so complicated that we cannot differentiate data just by the terms right or wrong. In brief, the only criterion for the evaluation of
* This project was supported by the National Natural Science Foundation of China (60503024; 50634010).
Novel method for the evaluation of data quality based on fuzzy control data quality is the capability to satisfy the application requirement. 2.2
Evaluation criteria for data quality
Just by considering one or two properties, it is not possible to judge the data quality, therefore, all data properties should be taken into consideration. Here, five properties are selected as the criteria of integrated estimation according to the quality measurement model proposed by ISO and the structural characters of software. (1) Data in time It means whether the time required for the reporting of data satisfies the requirement. (2) Date integrality It means whether the filled data are integrated.
3.1
607
The model of evaluation
The core of fuzzy control theory is the calculation of membership. ANN has stronger capability in terms of self-learning and association based on small jamming and higher precision. These virtues can be used to solve the problem with regard to the calculation of membership, and ANN model is selected for this purpose. We describe the ANN model as follow: Figure 1 shows the structure of ANN model, and 4-level BPNN is adopted in this article. The neural cell number at input level is considered as the element number in the system and that at the output level as the designed function number; the neural cell number at middle level including both condition and rule levels can be determined by experience. The network structure is given in Fig. 1.
(3) Data coherence It means whether the reported data units are unified and the corresponding relationships are in accordance with the system requirement. (4) Data veracity It means whether the filled data are in accordance with the facts. (5) Data uniqueness It means whether the reported data satisfy the requirement that one datum should be reported only once.
3. Data quality evaluation system based on fuzzy control theory It is very difficult to evaluate data quality using quantitative method because of the fact that every factor constructing the evaluation is provided with fuzziness. So we bring the fuzzy control theory for the effective evaluation of data quality. The concepts of fuzzy sets and fuzzy algorithm were presented by L. A. Zadeh who was a famous professor at the computer science department of the University of California in America in his “Fuzzy Sets”, “Fuzzy Algorithm”, etc in 1960’s. Fuzzy comprehensive judgment considers every criterion of data quality comprehensively, constructs comprehensive judgment model, and evaluates data quality using fuzzy sets theory.
Fig. 1
Structure of fuzzy ANN
We will adopt the network structure as shown in Fig. 1 to accomplish the processes of input fuzzification, fuzzy operation, output level, etc. We choose the Gaussian function as the membership function 2 x−c Gaussian(x, σ, c) = e−/2 σ (1) Gaussian membership function is determined by c and σ, where cdenotes the center of the membership function, and σ denotes its width. Every judgment with regard to the weight values of quality characters is different because of diverse applications, so weight vector is expressed as: W = (w1, w2, , . . . , wk ), wk is the weight of judgment criterion k, and w1 + w2 + . . . + wk = 1, wk ∈ [0, 1]. The judgment vector e(e = W oR) can be obtained by
608
Ban Xiaojuan, Ning Shurong, Xu Zhaolin & Cheng Peng
compound operation, and then the corresponding to the maximal membership of e is graded as the software quality evaluation grade. (1) The first level: Input level, at which the network information input X T = [x1 , x2 , . . . , xn ] should be the scores of every criterion that the manual auditing give, i.e. xT = [in time score, Integrality score, Coherence score, Veracity score, Uniqueness score] xn ∈ [0, 100], n = 5, i.e. criterionc score is expressed by numbers ranging from 0 to 100. (2) The second level: condition level, each note denotes one fuzzy language variable, i.e. excellent, good, middling, bad, and awful. The output is membership, which represents the degree to which it belongs to the language variable. The fuzzy parameters are determined by the connecting weights between the first and second level, i.e. c and σ in Gaussian function are determined. (3) The third level: rule level, there are five neural cells, they also express one fuzzy language variable i.e. excellent, good, middling, bad, and awful. The weighted sum is determined to get judgment vector e. The input from the second level to current level is lk , (k = 1, 2, . . . , 5), a new input value is set and turn l to l + 1. The output of this level is Zj (j = (1) 1, 2, . . . , 5), and wij denotes the connecting weight from the neural cell i in input level to neural cell j in middle level, and f (·) denotes the activation function. The output of the middle level Zj = f (netj ), j = 1, 2, . . . , p where netj =
M i=1
(1)
wij xi , j = 1, 2, . . . , p
(1)
(1)
(1)
w(i,j) (l + 1) = wi,j (l) + η∆wi,j
(2)
(3) (4)
(4) The fourth level: output level, the corresponding grade of the maximal membership in vector e is calculated to obtain the data quality judgment rank. The output of the output level yt = f (netk ), k = 1, 2, . . . , N + 1
(5)
where netk =
P j=1
(2)
Zj wj,k , k = 1, 2, . . . , N
(6)
(2)
The weight w(jk) denotes the connection weight from the neural cell j in middle level to the neural cell k in output level, and xi is the input of the ith neural cell. (2)
(2)
(2)
wj,k (l + 1) = wj,k (l) + η∆wj,k (2)
∆wj,k (l) =
∂E (2) ∂wj,k
= f (netk )
P
f
j=1
M i=1
(7) (2)
wj,k xi (8)
∂E
(1)
∆wi,j (l) =
(1)
∂wi,j M i=1
3.2
= f (netj ) (1)
wij xi
P j=1
M
(2)
wj,k f · (9)
xi
i=1
The result of training
The training data are obtained from the platform of “data report of national compulsory education outlay guarantee, shown as Table 1. The studied factors include data in time, date integrality, data coherence, data veracity, and data uniqueness. Table 1
Training sample data
In Time Integrality Coherence Veracity Uniqueness Grade 80
83
86
87
84
good
92
83
83
95
74
excellent
90
82
85
97
84
excellent
70
93
86
88
74
good
56
73
96
86
94
midding good
80
93
76
87
84
80
33
66
56
84
bad
70
73
36
67
94
midding
49
53
46
47
64
bad
80
83
84
67
54
middling
84
83
86
87
84
middling
The result of membership can be accurately obtained from the training sample data
4. Application example The data used in this study are obtained from the country compulsory education outlay guarantee platform, and the evaluation process is represented as follows. (1) Determination of comment set V = {v1 , v2 , v3 , v4 , v5 } {excellent, good, middling, bad, awful}
Novel method for the evaluation of data quality based on fuzzy control (2) Determination of factor set U = {u1 , u2 , u3 , u4 , u5 } ={In time, Integrality, Coherence, Veracity, Uniqueness} (3) Determination of the weight of judgment criteria t A = (0.20, 0.15, 0.15, 0.35, 0.15) (4) Determination of membership • In Time Table 2 shows the membership of the quality criteria according to the difference value between reporting date and deadline shown in Fig. 2. Table 2
Integrality quality criteria membership
rate
excellent
good
middling
excellent
good
1
0
middling bad
awful
date and deadline (−100, −20 ]
0
0
0
(−20, −10]
0
1
0
0
0
(−10, 0 ]
0
0
1
0
0
(0, 20 ]
0
0
0
1
0
(20, 100 ]
0
0
0
0
1
Fig. 2
In time quality criteria
• Integrality (a) Make statistics of sum N th items that must be filled and the number nth items which should be filled but not yet. (b) Normalization: calculate the proportion, i.e. the rate p = n/N . (c) Fuzzilization: divide the rate value as: [0–5%], (5–10%], (10–30%], (30–70%], (70–100%]. The integrality membership is shown in Table 3. • Coherence The coherence auditing result needs to be given by each auditing module operator at County-level, including {excellent, good, middling, bad, awful}. We must calculate the membership function according to the judgment results of every County-level operator
awful
[ 0 , 5% ]
0.9
0.8
0.5
0
0
0.2
0.9
0.7
0.3
0
(10%, 30%]
0
0.3
0.9
0.5
0
(30%, 70%]
0
0
0
0.9
0.3
(30%, 70%]
0
0
0
0
1
because of the difference in judgment. Here we compute the membership by NN method, and the coherence membership is shown in Table 4. Unique quality criteria membership
County-level
Difference value
bad
(5%, 10%]
Table 4
In time quality criteria membership
between reporting
Table 3
609
excellent
good
middling
bad
awful
excellent
0.9
0.8
0
0
0
auditing result
good
0.2
0.9
0.8
0
0
middling
0
0.3
0.9
0.5
0
bad
0
0
0.2
0.9
0.3
awful
0
0
0
0.3
0.9
• Veracity The veracity auditing result needs to be given by auditing module operator at County-level, including {excellent, good, middling, bad, awful}. We must calculate the membership function according to the judgment results of every County-level operator because of the difference in judgment. Here we compute the membership by NN method, and the veracity membership is shown in Table 5. Table 5
Veracity quality criteria membership
County-level
excellent
good
middling
bad
awful
excellent
0.8
0.7
0
0
0
auditing result
good
0.1
0.8
0.6
0
0
middling
0
0.1
0.8
0.6
0
bad
0
0
0.1
0.8
0.6
awful
0
0
0
0.3
0.8
• Uniqueness The rank of quality will be automatically degraded to one if the data is in repeated report forms. (5) Fuzzy comprehensive evaluation • By computing with the corresponding evaluation matrix R, we get the comprehensive evaluation matrix
610
Ban Xiaojuan, Ning Shurong, Xu Zhaolin & Cheng Peng
B, according to every proportion weight A, i.e. B = A ∗ R, which can be normalized to get the normalized matrix C. • The calculation of fuzzy comprehensive evaluation result D The data quality of report forms can be determined according to the maximal membership principle. Moreover, if “excellent”, “good”, “middling”, “bad”, “awful” are regulated to be represented by the numbers 90, 80, 70, 50, and 30, the quantitative result will be obtained. That is, D = C · [90 80 70 50 30].
5. Examination of the results Now the calculated data in table A are set to represent the whole process, and table A gives: Difference value between reporting date and deadline: -15 Items that must be filled: 60 Items that must be filled but not yet: 2 County-level auditing coherence result: good County-level auditing veracity result: excellent No repeated data report According to the maximal membership principle, we know that the data quality of the report forms is ‘good’. ⎤ ⎡ 90 ⎥ ⎢ ⎢ 80 ⎥ ⎥ ⎢ ⎥ ⎢ ⎥ The quantitative result is: D = C · ⎢ ⎢ 70 ⎥ = 82.7 ⎥ ⎢ ⎢ 50 ⎥ ⎦ ⎣ 30
6. Conclusions and future prospects This article combines the fuzzy control theory and NN for the evaluation of data quality, which solves the traditional problem that the bug can only be judged by manpower. Every criterion was considered for the evaluation of data quality. In addition, manual auditing and intelligent judgment are combined effectively to monitor data quality in real time, and this establishes strong basis for the detection of data problems
ahead of time. We takes the project from the ministry of education which called the platform of “data report of national compulsory education outlay guarantee” as background, evaluates the data quality from every country effectively, and the data quality situation can be found out more completely, objectively, and in better time by using the method.
References [1] Tao Pin, Zhang Bo, Ye Zhen. An incremental bicovering learning algorithm for constructive neural network. Journal of Software, 2003, 14(2): 194–201. [2] Hu Baoqing. Foundation of fuzzy theory. Wuhan: Wuhan University Press, 2004. [3] Aebi D, Perrochon L. Towards improving data quality. In: Sarda, N. L., ed. Proceedings of the International Conference on Information Systems and Management of Data. Delhi, 1993. 273-281. [4] Qiu Xiaoying, Chen Xuesong, Zheng Guoqin. The fuzziness of software quality of ERP. Computer Engineering, 2006, 32(5): 81-85
Ban Xiaojuan was born in 1970. She is an associate professor, doctor. Her research interests include artificial intelligence, artificial life, and their applications in computer animation. E-mail:
[email protected] Ning Shurong was born in 1976. She is a doctor. Her research interests include artificial intelligence, Artificial life and their application in computer animation. Xu Zhaolin was born in 1976. He is a master. His research interests include artificial intelligence and intelligent management. Cheng Peng was born in 1974. He is a senior engineer. His current research interests include machine vision, pattern recognition.