Computers Ops Res. Vol. 23, No. 3, pp. 263-273, 1996 Copyright © 1996ElsevierScienceLtd 0305-0548(95)00020-8 Printed in Great Britain.All rights reserved 0305-0548/96$15.00+ 0.00
Pergamon
ON
SOLVING
THE
CONTINUOUS
DATA
EDITING
PROBLEM
Cliff T. Ragsdale lt:~ a n d Patrick G. McKeown2§ Department of Management Science, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061-0235 and 2Department of Management, University of Georgia, Athens, GA 30602, U.S.A. (Received January 1994; in revisedform March 1995)
Scope and Purpose--It is well-known and readily accepted that many computerized data bases contain errors. The task of identifying records containing errors and the specificfieldscausing these errors is known as the data editing problem. Numerous mathematical programming (MP) techniques have been proposed for solving this problem. Under certain conditions, these MP approaches require excessive amounts of computational effort and have caused many researchers to opt for other heuristic solution techniques when trying to solve particularly difficultdata editing problems. In this paper, we present a refined MP procedure that offers new hope for solving difficultdata editing problems to optimality. Abstract--The data editing problem is concerned with identifying the most likely source of errors in computerized data bases. Given a record that is known to fail one or more logical consistency edits, the objective is to determine the minimum (possibly weighted) number of fields that could be changed in order to correct the record. While this problem can easily be formulated as a pure fixed-chargeproblem, it can be extremelydifficultto solve under certain data conditions. In this paper we show how a number of structural characteristics in this problem can be exploited to dramatically reduce the computational time required to solve particularly difficultdata editing problems.
1. I N T R O D U C T I O N Over the past decade, the w o r l d has witnessed a n explosion in the n u m b e r of c o m p u t e r i z e d d a t a bases being used in the public a n d private sectors. While these d a t a bases provide m a n y benefits, the p o t e n t i a l consequences o f errors in these data bases is significant a n d alarming. Indeed, a n e r r o n e o u s record in a c o m p u t e r i z e d data base led to a law suit recently argued before the U.S. S u p r e m e Court. I n discussing this case, Justice R u t h Bader G i n s b u r g remarked, " W e ' r e getting to the 21st century. This is w h a t will be a major p r o b l e m - - c o m p u t e r s that have misinformation" [1]. I n the past, d a t a entry mistakes b y h u m a n k e y p u n c h operators were p r o b a b l y the m o s t c o m m o n source o f " m i s i n f o r m a t i o n " or errors in d a t a bases. While a u t o m a t e d d a t a e n t r y devices (such as b a r code scanners a n d optical character recognition) have reduced the a m o u n t of m a n u a l data entry, this t e c h n o l o g y is n o t flawless. Similar l i m i t a t i o n s in the technology associated with a u t o m a t e d voice recognition, h a n d w r i t i n g recognition, a n d m a c h i n e vision are likely to p r e v e n t future data bases f r o m being free of i n p u t errors. A n d as long as h u m a n s are responsible for m a i n t a i n i n g a n d u p d a t i n g c o m p u t e r i z e d d a t a bases, we c a n be sure that they will n o t be error-free. tCliffT. Ragsdale is an Assistant Professor of Management Scienceat Virginia Polytechnic Institute and State University. He received his B.A. and M.B.A. degrees from the University of Central Florida and holds a Ph.D. in Management Science and Information Technology from the University of Georgia. Dr Ragsdale's primary research interests are in the areas of applied statistics, optimization, and artificial intelligence.His research has appeared in Decision Sciences, Naval Research Logistics, Computers & OperationsResearch, OMEGA, OperationalResearch Letters, FinancialServices Review, and other publications. He is also author of the book Spreadsheet Modeling and Decision Analysis: A Practical Introduction to Management Science, published recently by Course Technology, Inc. SAuthor to whom correspondence should be addressed. §Patrick G. McKeown is a Professor in the Department of Management at the University of Georgia. He received his Ph.D. from the University of North Carolina at Chapel Hill and his M,S. and B.S. degrees from the Georgia Institute of Technology, Dr McKeown's primary research interests are in the areas of linear and integer programming, and algorithm development. He has authored numerous books in the areas of management science, computer programming, and information systems and technology. His research has appeared in several journals including Operations Research, Management Science, Naval Research Logistics Quarterly, Computers & Operations Research, and the SIAM Journal of Scientific and Statistical Computing. 263
264
Cliff T. Ragsdale and Patrick G. McKeown
Given the enormous amounts of data held in computerized form and the seemingly unavoidable errors that enter these data bases, several researchers have sought automated techniques for identifying and correcting these errors so as to improve the integrity and quality of decisions made on the basis of this information [2-8]. An excellent review of the work in this area is given in [9]. The task of identifying records containing errors and the specific fields causing these errors is known as the data editing problem. To introduce this problem let us suppose x is an (n x 1) vector representing a record of data where xj represents the entry in fieldj. Further, suppose it is possible to identify an (rn x n) matrix A and an (m x 1) matrix b such that for any record not containing an error we have Ax ~< b. Let ~ = {x ] An ~< b} denote the set of all records which "pass" these edits. Then for any "failing" record x ° 9~ ~ , the following model can be used to identify the specific fields in x ° which require changes in order to produce a "correct" (or at least passing ) record (MWFIC)
Min s.t.
f'~(t)
(1)
A(x ° + t ) ~ < b
(2)
where ~5(t)
f 1, 0,
if t j ¢ 0 otherwise.
(3)
This is known as the Minimum Weighted Fields to Inpute problem for Continuous data (MWFIC) as it is assumed that the values for the data in x ° fall within some continuous interval. The (n x 1) vector t in (2) is composed of the elements tj that represent the amounts of change required to the fields x° in order to produce a "passing" record. The objective in (1) is to determine the minimum weighted number of fields which must be changed in order to produce a "passing" record. The weights given by the (n x 1) vector f in (1) represent the relative confidence the analyst has in the accuracy of the components of x °. 2. PREVIOUS WORK It has been shown that M W F I C is just a special case of the Linear Fixed Charge Problem (LFCP) with no continuous costs [7]. It is well-known that this type of problem can be extremely difficult to solve using traditional mixed-integer programming methods [10, 11]. Hence, a specialized algorithm (referred to here as the G K L algorithm) was developed to solve this problem [12]. We will use the following notation in the ensuing discussion of this algorithm. The ith constraint (row) of the A matrix will be denoted by Ci whilst itsjth column will be referred to as Aj. The element in the ith row a n d j t h column of a matrix such as A will be denoted by A U. For any record x ° ~ ~ , the matrix A will be divided into two matrices, P and F, which represents the passed and failed edits, respectively. That is, P consists of {Ci [ C/x ° ~< hi} and F eor~sists of {Ci I Ci x° > hi}. The matrix b will be divided in a like manner into bp and bf to correspond with P and F. The G K L algorithm can now be summarized as follows: S T E P 1: (Check the record for errors) If Ax ° ~ b then stop, the record x ° does not contain an error. Otherwise, go to STEP 2. S T E P 2: (Generate a candidate solution) Partition A into P and F and solve the following set covering problem (SCP)
Min s.t.
f'w
(4)
Sw> 1
(5)
wj E {0, 1}, j = 1 , . . . , n
(6)
where
Sij =
0,
if Fij = 0
1,
otherwise.
(7)
On solving the continuous data editing problem
265
In (5), 1 represents an appropriately dimensioned column vector of ones. Denote the optimal solution to (SCP) by w* and let K = {jlw~ = 1}. STEP 3: (Test the feasibility of the candidate solution from STEP 2) Partition A into [B, B] where B is a matrix composed of the columns Aj for j c K and B is composed of the columns Aj f o r j E K. Similarly, partition x into [xK, x£-]. If BxK ~< b - Bx~ has a feasible solution x~:, then stop. K is the set of fields to change and an optimal imputation is (x~:, x°). Otherwise, go to Step 4. STEP 4: (Generate a new constraint which cuts the current candidate solution from SCP) Solve the system
pB=0,
p(b-Bx 0)< -1,
p>~0
(8)
where p is a (1 x m) vector and 0 is an appropriately dimensioned vector of zeros. Denote the solution to this system by p*, include the new edit given by p*Ax ~< p*b in A and b and go to STEP 2. 3. R E F I N E M E N T S
While the G K L algorithm seems quite appealing, it was shown to be too computationally expensive for certain types of problems. Specifically, the authors of the G K L algorithm note that it, " . . . is likely to perform well whenever the number of iterations, and hence the number of SCPs that must be solved, is very small. In tightly constrained edit systems, however, it is possible that many implied edits may have to be generated before M W F I C is solved. Also, if there are many fields, and if records typically fail a large number of edits, even a small number of iterations in [the G K L algorithm] may prove computationally expensive. Thus, alternative procedures, even if suboptimal, are needed." [11, p. 926] Thus, this algorithm has largely been abandoned in favor of heuristic solution procedures. However, we have discovered that the efficiency of the G K L algorithm can be greatly improved by the following refinements to Step 2 of the algorithm.
3.1. A modified set covering problem The idea behind Step 2 in the G K L algorithm is to derive a candidate solution for M W F I C by solving a related set covering problem. Unfortunately, the candidate solution generated by this covering problem may not be feasible in MWFIC, necessitating Steps 3 and 4 of the G K L algorithm. Obviously, one would like to maximize the probability that the candidate solution from Step 2 is feasible in MWFIC. However, the set covering problem suggested in (4)-(7) may frequently lead to easily detected and avoidable infeasible candidate solutions to M W F I C which create unnecessary iterations through Steps 3 and 4. To see how these infeasible candidate solutions can occur and be avoided, consider the following system of edits CI:
2Xl + x 2
C2:
-x 1
< 5
(9)
+ x 3 ~< 1
(10)
Suppose f~ = (1,2, 3) and x ° = (1,4, 3) so that both edits are failed. Then the set covering problem in Step 2 is given by Min
Wl + 2 W 2 + 3 W 3
s.t.
W1 +
~ 1
(12)
~> 1
(13)
w3 E {0, 1}.
(14)
W2
wl Wl, w2,
(11)
+ w3
The solution to this problem is w* -- (1,0, 0). Thus, K = { 1} and we proceed to Step 3. However, it
266
Cliff T. Ragsdale and Patrick G. McKeown
is immediately obvious that there is no solution to BxK ~< b - Bx ° given by 2xl ~< 1 --X 1 ~
(15) 2
--
(16)
since (15) can only be satisfied if Xl < 0.5 and (16) can only be satisfied ifxl >~ 2. Thus, the candidate solution generated in Step 2 is infeasible for the problem in Step 3 and, according to the original G K L algorithm, we must proceed to Step 4. Let us now stop and consider why the candidate solution identified in Step 2 is infeasible in Step 3. In Step 2 we are trying to identify a minimum weighted subset of the variables in x ° which can be changed in a manner which will pass all the edits. This is attempted by solving the set covering problem in Step 2 which, for our example, is given by (11)-(14). The logic behind (12) is that since (9) was failed, some changes need to occur in the values of x ° and/or x ° if we intend to correct the record in a way that passes (9). Similarly, (13) indicates that some change needs to occur in the values of x ° and/or x ° if we intend to pass (10). The problem with this approach is that it ignores the direction of the changes that need to take place since (7) ignores the signs of the coefficients F/j. To see this, recall that to find the changes which need to occur in x ° in order to pass the failed edits we want to find values for the vector t such that F(x ° + t) ~< bf
(I7a)
- F t ~> Fx °
(17b)
or equivalently -
bf
Now for the preceding example in (9)-(10) the constraints represented by (17b) are given by - 2 t l - t2 tl
- t3
>i 1
(18)
>i 1.
(19)
Since the variables tj represent the amounts by which we must change the values x°, it is clear that if we are only allowed to change x ° (as was implied by K = {1} in the solution to Step 2 in (11)-(14)), no feasible solution will result since (18) can only be satisfied if t 1 takes on a negative value (in particular, tl ~< - 0.5) and (19) can only be satisfied if tl takes on a positive value (in particular tl >/1). Of course, x ° cannot be simultaneously increased and decreased. Thus, it would be helpful if the set covering problem solved in Step 2 could be modified to indicate not only which fields need to be changed in order for the record to pass a given edit, but also in what direction each field needs to be changed (i.e. increased or decreased) if it is selected to correct the record. T o see how this might be accomplished, recall that any unrestricted variable tj can be expressed as the difference of two non-negative variables t f and tj- (i.e. tj = t f - tf, where tf, q >~ 0). Thus, (17b) may be written as - F ( t + - t - ) >~ Fx ° - bf,
(20a)
- F t + + F t - ~> Fx ° - bf.
(20b)
or equivalently
where t+, t f ~> 0. Thus, t+ and t f represent the amounts by which we propose to increase or decrease the value of x°, respectively. This use of strictly positive deviational variables in the data editing problem has also been used by others [7]. For our example, the constraints of (20b) are given by - 2 t [ - t+ t+
+ 2ti- + t~ -tf
-
tl
+t;
>~ 1
(21)
t> 1.
(22)
Since the right hand side (RHS) of the inequality in (21) is greater than zero, it is obvious that if (21) is to be satisfied at least one of the variables ti- and t~ (i.e. variables with positive coefficients) must assume a positive value. Similarly, since the RHS of (22) is greater t h a n zero, if (22) is to be satisfied, at least one of the variables tl+ and t~ (i.e. variables with positive coefficients) must assume a positive value.
On solving the continuous data editing problem
267
To generalize this idea, it is easy to show that for any data editing problem all RHS values in (20b) will be greater than zero since by definition Fx ° > bf. Thus, if there is a feasible solution to the data editing problem, each constraint in (20b) must have at least one positive coefficient. (This is also easily proved for if each row of F is non-null (which it must be if an edit exists) then each row of [-F, F] must have a positive element.) So in general we can say that at least one of the variables t+ or t)- in each constraint of (20b) has a positive coefficient and must assume a positive value if (20b) is to be satisfied [since t + = t - = 0 is not a solution to (20b)]. Returning now to our example, we are motivated to suggest the following modified set covering problem to be used in identifying a candidate solution in Step 2. Min
w+ + wi- + 2w2 + 3w3
s.t.
+Wl+
w2
w+
(23) ~>1
(24)
+ w3 ~> 1
(25)
~< 1
(26)
w + + wiwJ-,w~ E {0, 1}, gj.
(27)
In (24) we are indicating that any solution of (21) must involve at least one of the variables t l and t2. In (25) we indicate that any solution of (22) must involve at least one of the variables t + and t3. Recall that t+ and t)- represent the amounts by which we propose to increase or decrease the value of x°, respectively. Since x° cannot be increased and decreased simultaneously, (26) limits us to obtaining, at most, one of these alternatives. Note that without this last restriction in (26), the optimal solution to the problem would be w+ = wi- = 1 (i.e. it indicates that field 1 should be simultaneously increased and decreased) which is equivalent to the infeasible solution identified earlier using SCP in (11)-(14). To generalize this idea, we propose to solve the following modified SCP in Step 2 of the G K L algorithm (MSCP)
Min
f'w + + f'w-
(28)
s.t. S+w + + S - w -
>t 1
(29)
w+
~< 1
(30)
+ w-
wj, %7 c {0, 1}, Vj.
(31)
where
S+ =
1,
if -Fij > O,
O,
otherwise.
(32)
and Si)
SI'
ifFij>0,
L0,
otherwise.
(33)
Using this formulation the definition of the candidate solution K for the feasibility test in Step 3 changes to K = {Jl w+ = 1 or wj- = 1 in the solution to (MSCP)}. So by using (MSCP) in place of (SCP) we have a model that is more likely to produce a feasible solution to MWFIC. However, this model is not a pure set covering problem and therefore cannot be solved using specialized set covering algorithms. Also, (MSCP) has twice as many variables and many more constraints than (SCP). Thus, by using (MSCP) there will clearly be a trade-off between the quality of the candidate solutions and the ease with which they are generated. Our computational results, discussed later, suggests this trade-off is worthwhile.
3.2. Combining passed and failed edits Another potential cause of the poor results previously obtained with the G K L algorithm may be due to the fact that when solving (SCP) [or alternatively (MSCP)] in Step 2 we are only using the
268
Cliff T. Ragsdale and Patrick G. McKeown
information provided by the failed edits (F). Yet when determining the changes which need to occur in x ° in order to pass the failed edits, it is equally important not to cause a violation of any of the passed edits (P). That is, when determining t in (17a) it is equally important that we consider P(x ° + t) ~< bp
(34a)
Pt + - P t - ~< bp - P x °.
(34b)
or equivalently
Unfortunately, the general method described above for generating set covering constraints from (20b) does not apply to (34b) since t + = t - = 0 is a solution to (34b). However, it may be possible to form linear combinations of the constraints in (20b) and (34b) from which additional nonredundant set covering constraints can be generated [4]. These new set covering constraints can then be included in the problem solved in Step 2 to provide important information about the passed edits (P). As a simple example, suppose we have a data base with two fields and one failed edit for which F = [1, -1] and Fx ° - bf = 5. Thus, (20b) is represented by - t ~ + t~- + ti- - t2 ~> 5
(35)
which by (29) generates the set covering constraint w+ + w~- ~> 1.
(36)
Now assume there is also a single passed edit for which P = [1, 1] and bp - P x ° = 2. Thus, (34b) is represented by t~- + t~- - t l - t2 <~ 2.
(37)
Note that (37) cannot be used to generate a set covering constraint since t+ = t f = 0 is a feasible solution for this inequality. However, if we multiply (37) by negative one and add it to (35) we obtain the redundant constraint given by -2t/+
2ti- ~> 3
(38)
which generates the following non-redundant set covering constraint Wl ~> 1.
(39)
Thus, carefully selected linear combinations of the constraints in (20b) and (34b) may be used to produce additional set covering constraints that may improve the formulation of (SCP) and/or (MSCP). In general, these linear combinations can easily be made from any passed edit i, failed edit k and field j where the signs of the non-zero coefficients Pq and - F ~ j are the same and
(--Fkj/Pij) × (bpi - Pix 0) < Fkx ° - bfk.
(40)
Here Pi and Fk refer to the ith row of P and the kth row of F, respectively. The passed edit i will be multiplied by - F ~ j / P i j and subtracted from the failed edit k, just as in the preceding example. The condition in (40) ensures that the r.h.s, value of the resulting redundant "greater than" constraint will be strictly positive which, in turn, allows a set covering constraint to be generated. The redundant constraint in terms of t is given by ( - F k - (Fkj/Pij)Pi(t + -- t-) >>. (Fkx ° - bf~) - ( - F k j / P i j )
× (bpi - Pix°).
(41)
The coefficients for t+ and t f in (41) will be zero and the r.h.s, of this constraint will be greater than zero [by (40)]. Thus, the set covering constraint generated by (41) will not involve w+ or wf and may therefore help to improve the formulation of the problem used in Step 2 of the G K L algorithm. In the preceding example, it is interesting to note that the constraint in (39) actually dominates the original set covering constraint (36). That is, if both constraints are included in MSCP (36) is redundant since it is always satisfied when (39) is satisfied and, indeed, (39) will always be satisfied
On solving the continuous data editing problem
269
via primal feasibility. Thus, once the collection of all possible set covering constraints for a given record x ° has been identified, these should be reduced to a "best set" of non-dominated constraints. Procedures for such reductions are discussed in [13].
4. METHODOLOGY As mentioned previously, the M W F I C problem can be formulated as a LFCP. Thus, the following procedures for solving M W F I C will be used for computational testing purposes: (1) M W F I - - T h e fixed-charge (MWFI) formulation given in [7] including the set covering constraints defined above in (29) (and discussed in [10]). (2) P M W F I - - A preprocessed M W F I formulation with additional non-dominated set covering constraints generated from redundant constraints as discussed above in section 3.2. (3) G K L - - T h e standard GKL algorithm. (4) P G K L - - S a m e as GKL with the preprocessor to generate additional nondominated set covering constraints run first. (5) M G K L - - T h e standard GKL algorithm using the alternate (MSCP) problem in place of (SCP) in Step 2 of the algorithm. (6) P M G K L - - S a m e as M G K L with the preprocessor to generate additional nondominated set covering constraints run first. If the refinements proposed above in Section 3 are of any value, it is incumbent on us to show that they offer some advantage over the original G K L algorithm. One way to do this is to show that these refinements perform well in situations where the original G K L algorithm performs poorly. A knowledge-based optimization system could then be employed to select either the original or refined version of the G K L algorithm depending on the characteristics of the problem to be solved [14-17]. Earlier, we noted that the authors of the G K L algorithm found that it is likely to perform poorly "if there are many fields, and if records typicallyfaiI a large number of edits." So we shall conduct our testing under these specific data conditions. The ratio of the number of fields to edits for all problems tested in [12] was approximately 0.96. The test problems we solve here have fifty fields and twenty edits. Thus, these problems have "many fields" relative to the number of edits (e.g. 50/20 = 2.5 vs 0.96). These problems were randomly generated with the following characteristics
fs.=l IAejl ~< 20 0~
270
Cliff T. Ragsdale and Patrick G. McKeown CPU Seconds (including IIO)
40 30 20 75%
10 -
2 ° , '
0 MWFI
PMWFI
GKL
PGKL
MGKL
PMGKL
Failed Edits
Solution Procedure " - not drawn
to s c a l e
Fig. 1. C o m p a r i s o n o f solution m e t h o d s ' a v e r a g e r u n times.
5. C O M P U T A T I O N A L
RESULTS
A graphical display of the average run times for the various combinations of failed edits and solution methods is given in Fig. 1. The numerical details underlying this graph and the ANOVA results are given in Table 1. The run times listed in this table for P M W F I , P G K L and P M G K L Table I. Comparison of solution methods
edits
MWFI
Run times in CFU secondsa (including I / 0 ) % of failed PMWFI GKL PGKL MGKL
PMGKL
5
Best Average Worst
0.80 0.87 0.96
0.82 0.86 0,90
0.79 0.84 0.96
0.80 0.82 0.84
0.88 0.95 1.17
0.88 0.91 0.93
25
Best Average Worst
0.98 19.92 80.92
1.56 8.02 25.34
1.02 9.54 39.13
0.92 5,32 11.76
0.91 1.91 4.01
1.00 1.89 3.60
50
Best Average Worst
1.27 10.52 35.08
1.90 3.76 6,06
0.94 28.17 95.23
2.27 11.84 38.60
0,97 2,25 5.83
1.13 1.25 1.49
75
Best Average Worst
2.33 17.33 40.95
2.68 11.70 27.59
6.22 127.28 441.89
10.92 87.85 237.62
1.04 2.69 7.14
1.15 1,53 2.16
95
Best Average Worst
2.19 9.98 34.82
2.67 7.74 20.86
87.94 193.64 321.62
53.29 161.45 362.24
0.96 1.26 2.07
1.07 1.30 1.91
"Times shown for PMWFI, PGKL and PMGKL include time taken by SETCUT. ANOVA results Source
Sum of squares
F-value
P-value o f f
1 4 5 20 20
274.1652 116.8155 107.5028 89.3220 22.5528
316.14 33.67 24.79 5.15 1.30
0.0001 0.0001 0.0001 0.0001 0.1968
Model
50
610.3584
14.08
0.0001
Error
100
86.0816
Intercept Factor A Factor B A, B interaction C nested within A
D.F.
Index of factors: A: % of Edits Failed; B: Solution Method; C: Subjects (or problems).
MSE = 0.86723
On solving the continuous data editing problem
271
include the amount of time it took to generate additional non-dominated set covering constraints with the S E T C U T preprocessor. In all cases the S E T C U T processing time amounted to less than 0.03 CPU seconds. F r o m the A N O V A results in Table 1 we note that there are significant interaction effects occurring between the "percentage of failed edits" and "solution method" factors. In Fig. 1 it is clear that these interactions are of a very complicated form. However, a number of important observations are suggested from Fig. 1. First, it is clear that none of the solution methods had much difficulty solving the problems when 5% of the edits were failed. All six methods were able to solve these problems (on average) in less than one CPU second. However, if more than 5% of the edits are failed, this situation changes drastically. In this case, the M W F I and P M W F I methods continue to solve the problems (on average) in a moderate, but longer than desirable, amount of CPU time (usually less than 20 CPU seconds). F r o m Table 1, the worst case performance using the M W F I method was a problem with 25% failed edits which took 80.92 s. Overall, the run times using these methods were fairly stable (on average) when more than 5% of the edits were failed, varying between 10 and 20 s for M W F I and from about 4 to 12 s for PMWFI. The results for the GKL and P G K L methods when more than 5% of the edits are failed are not as encouraging. As reported by G K L , there seems to be a strong relationship between the number of edits failed and the time required to solve the problem. The problems with more than 50% failed edits took (on average) more than one CPU minute with a worst case problem taking over 7 CPU minutes. The best results by far were obtained using the M G K L and P M G K L methods. The average run time using either of these methods was less than 3 CPU seconds, regardless of the number of failed edits. In the worst cases, the M G K L and P M G K L methods took 7.14 and 3.60 s, respectively. This represents roughly a 1100% improvement over the worst case performance using M W F I method and more than 6100% improvement over GKL. The relative overall performance of the various solution methods may be formally assessed by a comparison of their factor level means. Specifically, interest would center on nine different comparisons: three pairwise comparisons among the MWFI, GKL and M G K L methods, three pairwise comparisons among their preprocessed counterparts (PMWFI, P G K L and P M G K L , respectively), and pairwise comparisons between M W F I and PMWFI, GKL and PGKL, and finally M G K L and P M G K L . These last three comparisons will reveal if the S E T C U T preprocessor makes a significant difference in the average run times. These nine pairwise comparisons were made (using a Bonferroni multiple with a = 0.05) and indicate significant differences in all of the first six comparisons and no significant differences in the last three (with a 95% family confidence coefficient). That is, the average run time using M G K L is significantly different (faster) than M W F I and GKL, with M W F I also being significantly different (faster) than GKL. The same relationships hold among the preprocessed counterparts of these formulations. However, the S E T C U T preprocessor used with the P M W F I , P G K L and P M G K L made no significant difference in the average run times when compared to MWFI, GKL and MGKL, respectively. It might be argued that the performance of the GKL methods would have been better had a specialized algorithm been used to solve the set covering problems (SCP) rather than using the standard branch-and-bound techniques offered by MSPX/MIPS/370. Greater insights into the absolute advantage of the M G K L methods over GKL can be obtained from the information in Table 2. In Table 2 we show the average number of iterations or cuts required by each of the GKL and
Table 2. Comparison of M G K L and G K L average number of cuts % of failed edits
GKL
PGKL
MGKL
PMGKL
SETCUT
5 25 50 75 95
0.20 11.40 20.40 31.40 60.20
0.00 8.20 4.80 22.20 40.00
0.20 3.00 3.40 2.20 0.20
0.00 1.60 0.00 0.00 0.00
5.80 22.00 43.00 32.80 12.00
272
Cliff T. Ragsdale and Patrick G. McKeown
MGKL methods. Of the 25 problems tested, only two required any cuts at all using PMGKL. That is, the optimal solutions (defined by K) to 23 of the problems were found by solving the initial (MSCP) problem with the additional non-dominated set covering constraints added. For the problems with 95% failed edits, an average of 60 iterations would be required using the GKL procedure [with (SCP)] regardless of the optimization procedure used to solve the set covering problems. It is doubtful that the fastest set covering algorithms would solve these problems quickly enough to out-perform PMGKL. The column labeled "SETCUT" in Table 2 gives the average number of additional nondominated set covering constraints that were generated for each type of problem. For instance, when only 5% of the edits were failed an average of 5.8 additional cuts could be made from redundant constraints generated by combining passed and failed edits. This number increases to a maximum when 50% of the edits are failed and then begins to decrease. This is reasonable for if 100% of the edits were failed, there would be no passed edits from which to generate additional redundant constraints. It is interesting to note from Table 1 that generating these additional constraints (used in PMWFI, PGKL and PMGKL) reduced the average run times required to solve the problems in almost all cases. However, as shown earlier, these differences are not statistically significant. 6. C O N C L U S I O N S
In this paper we have presented a refined approach for solving the minimum weighted fields to impute problem for continuous data. While this algorithm was previously found to require prohibitive computational requirements, the refinements and modification we suggest appear to drastically improve its performance on a class of "worst case" problems. The major factor causing this improvement is the use of a modified set covering problem which generates candidate solutions that have a higher probability of producing optimal solutions for the data editing problem. These results are significant as they offer hope for the attainment of globally optimal solutions for this class of particularly difficult data editing problems, rather than heuristic solutions of uncertain quality. Thus, the technique presented here, which is effective on problems involving many fields where a record fails a large number of edits, could be combined with the original GKL algorithm in a knowledge-based framework to create a system that is capable of solving a variety of problems in an efficient manner. Future work in this area could explore the applicability of these procedures to the case of mixed data editing problems involving both continuous and discrete data types [8]. REFERENCES 1. Associated Press, Court Debates Computer Goofs, Roanoke Times & Worm News, December 8, A20 (1994). 2. R. J. Freund and H. O. Hartley, A procedure for automatic edit and imputation. J. Am. Statist. Assoc. 63, 341-352 (1967). 3. J. I. Naus, T. G. Johnson and R. Montalvo, A probabilistic model for identifying errors in data editing, J. Am. Statist. Assoc. 67, 943-950 (1972). 4. I. P. Fellegi and D. Holt, A systematic approach to automatic edit and imputation. J. Am. Statist. Assoc. 61, 17-35 (1976). 5. B. Greenburg, Developing an edit system for industry statistics, Computer Science and Statistics: Proe. 13th Syrup. on the Interface, Pittsburgh, Pa (Edited by W. F. Eddy), pp. 11-16. Springer, Berlin (1981). 6. G. E. Liepins, Can automatic data editing by justified? One person's opinion. Statistical Methods and Improvement of Quality, pp. 205-213. Academic Press, New York (1983). 7. P. G. McKeown, A mathematical programming approach to editing of continuous survey data. SIAM J. Seient. Stat. Comput. 5, 784-797 (1984). 8. J. R. Schaffer, A procedure for solving the data editing problem with both continuous and discrete data types. NavalRes. Logistics Q. 34~ 879-890 (1987). 9. M. Pierzchala, A review of the state of the art in automated data editing and imputation. J. Official Stat. 6, 355-377 (1990). 10. P. G. McKeown and C. T. Ragsdale, A computational study of using preprocessing and stronger formulations to solve large general fixed-charge problems. Computers Op. Res. 17, 9-16 (1990). 11. D. Wright and C. Haehling yon Lanzenauer, Solving the fixed-charge problem with Lagrangian relaxation and cost allocation heuristics. Eur. J. Op. Res. 42, 304-312 (1989). 12. R. S. Garfinkel, A. S. Kunnathur and G. E. Liepens, Error localization for erroneous data: continuous data, linear constraints. SIAM J. Scient. Stat. Comput. 9, 922-931 (1988). 13. G. L. Nemhauser and L. A. Wolsey, Integer and Combinatorial Optimization. Wiley, New York (1988).
O n solving the c o n t i n u o u s d a t a editing p r o b l e m
273
14. K. R. MacLeod and G. R. Reeves, AXIS: A framework for combining structured algorithms and knowledge-based systems. Computers Op. Res. 20, 613-623 (1993). 15. K. R. MacLeod and G. R. Reeves, An application of the AXIS solution framework to multiple objective aggregate production planning. Decision Sci. 23, 1315-1332 (1992). 16. A. G. Greenwood, L. P. Rees and I. W. Crouch, Separating the art and science of simulation optimization: a knowledgebased architecture providing for machine learning, liE Trans. 25, 70-83 (1993). 17. I.W. Crouch, A. G. Greenwood and L. P. Rees, Separating the art and science of simulation optimization: a knowledgebased architecture providing for machine learning. Naval Res. Logist. In press. 18. J. Neter, W. Wasserman and M. H. Kutner, Applied Linear Statistical Models, 2nd edn, Richard D. Irwin, Homewood, Illinois (1985). 19. International Business Machines Corporation, IBM Mathematical Programming Systems Extended~370 (MPSX/370), MLved Integer Programming/370 (MIP/370), White Plains, New York (1975).