Injormntion Systems Vol. 15. No. 5. pp. 537-542. 1990 Printed in Great Britain. All rights reserved
0306-4379,90 53.00 + 0.00 Copyright g 1990 Pergamon Press plc
SOME EXPERIMENTS IN THE USE OF CLUSTERING FOR DATA VALIDATION-f W.
F.
‘Department 94D. IBM/Information
STORER’
and C. M.
EASTMAN’:
Network, 3405 West Buffalo Ave. Tampa, Fl 33630, U.S.A.
2Department of Computer Science, University of South Carolina, Columbia, SC 29208. U.S.A. (Received 25 October 1989; in revised form
I February 1990; received for publication 19 June 1990)
Abstract-Previous work has demonstrated the feasibility of using clustering to detect errors in small databases. Singleton clusters, containing only one record, often represent errors. Use of a new clustering algorithm oriented to this application provided improvements in time complexity without degradation of the error detection performance in a larger collection of data. Key wor& Clustering. data errors, data validation
INTRODUCTION
The existence of data errors in information systems of all kinds can present serious problems, and a wide range of techniques have been developed to handle such errors. The focus of such techniques may be on prevention, detection or correction. Data entry procedures are designed to reduce the probability of input errors. Edit programs of varying degrees of sophistication are used to screen the data for errors. Attention is paid to system design to ensure that the system does not corrupt the data. Auditing procedures examine the system activities for problems, including the quality of the data. Overviews of these approaches are presented in a number of introductory books, including those by Cardenas [l], Loomis [2], Salzberg [3] and Wiederhold [4]. Our interest here is in the process of data editing, in which data records are examined in some manner for potential errors. Simple approaches to this problem include the use of range checks and intrarecord consistency checks. A number of more sophisticated statistically based approaches have been developed. Some of these are described in Felligi and Holt [S], Freund and Hartley [6], Naus [7], Naus et al. [8] and Wright [9]. Many of these approaches are designed for systems to handle statistical data. Some of the problems of concern in this context, such as the validity of statistical inferences, do not arise in other information systems. However, errors are still a problem. One approach that has been developed uses clustering to flag unusual records as possible errors (10, 1I]; such unusual records are often termed tThe research described here was performed while the authors were with the Department of Mathematics and Computer Science, Florida State University, Tallahassee, Fla. IAuthor for correspondence.
outliers. This approach is explored in this note. A new clustering algorithm designed for use in this application is described. Some experiments examining the performance of this algorithm and two others in finding errors in a file of road intersection data are then described. This work is based upon Storer [12]; a summary is presented in Storer &d Eastman [ 131. CLUSTERING
AND OUTLIERS
The area of clustering analysis has been active for many years. Problems of attribute choice and measurement, measures of similarity and dissimilarity, and clustering methods and algorithms have been extensively studied, and the approaches developed applied to problems across a broad range of disciplines. Extended treatments of clustering include those by Anderberg [lo], Duran and Ode11 [14], Everitt 1151, Hartigan [16], Romesburg [17] and Sneath and Sokal [18]. It is hard to define an outlier precisely, but it can be informally regarded as a data point which is sufficiently different from other data points to be regarded as unusual. Figure 1 illustrates the relationship of outliers to clusters. An outlier may result from an error, but this is not necessarily the case. The problem of handling outliers in statistical data has been extensively studied; Bamett and Lewis [19] provide a comprehensive overview of this area. Approaches to this problem include the development of statistical procedures which are robust in the presence of outliers and the development of methods to detect outliers. Most attention has been given to the univariate case, but some methods have been developed for the multivariate case. Outliers are widely recognized as a potential problem in the application of clustering algorithms, and some clustering algorithms are especially sensitive to
537
538
W. F. STORERand C. M. EASTMAN
n
cx X
X
x
X
I
X
x
X
X
X
Fig. 2. The short spanning path algorithm. Fig. I. Clusters and outliers.
the presence of outliers. It is often recommended that outliers be examined and possibly removed before clustering is performed [lo, 15, 171. Some effort has been directed to the problem of identifying unusual records in the context of clustering. Hall [20] and Goodall [21] describe measures which can be used to identify unusual objects. Choice of the cluster member most dissimilar to its cluster center to start a new cluster is used in some clustering algorithms [22]. Our interest here is in the use of clustering to detect data errors, rather than in the impact of outliers on cluster structure. Anderberg [IO] comments that outhers may show up in clusters of one or two items; he suggests that a variation of the k-means algorithms (231 might be useful in outlier detection. This algorithm is one of those based upon an iterative partitioning of the data. This approach involves the selection of k seed points as initial cluster centers. The number of clusters may be adjusted by merging clusters whose centroids are separated by less than a specified distance C and starting new clusters with points greater than a distance R from the nearest centroid. Anderberg suggests selecting a larger R and a smaller C than usual to use this algorithm in outlier detection. Anderberg also comments on outlier detection using the clustering algorithms proposed by Ball and Hall [24] and Wishart [25]. SOME PREVIOUS
RESULTS
Lee et al. [ 1I] use clustering to find data exceptions in small databases when only minimal information concerning the nature of the data is provided. Two sets of data are used for experimental purposes. The first set of data consisted of 20 records from the iris data. Each record represented an iris and contained four fields used to describe the iris. One of the fields in a record was deliberately set to an incorrect value. This error was then discovered using clustering. The second set of data consisted of 50 records containing information about warships. Each record represented a warship and contained eight fields that were used to describe the warship’s characteristics.
Clustering analysis flagged four records in this data; two contained previously undetected errors. The clustering algorithm used is a variation of the short spanning path algorithm described by Slagle et al. [26]. In this algorithm a short spanning path is constructed as an approximation to the shortest path. (Determination of the shortest spanning path is an NP-complete problem.) Links longer than a specified length are regarded as inconsistent; they are broken to form clusters. A diagram illustrating this algorithm is shown in Fig. 2. This algorithm has time complexity O(N*M + M*N), where N is the number of records and A4 is the number of fields. Other discussions of short spanning path clustering algorithms can be found in Bhat and Haupt [27], McCormick et al. [28] and Slagle er al. [29]. The distance measure used handles records containing both coded and numeric fields. Each of the N records R, contains M fields, xi,, xi,, . . . , .r,,. The distance between two records Ri and R, is calculated as M du = C WkC(x,, xjkh k-l
where C(Xti 9 xjk
) =
6 (X*
1 x,k 1,
for a coded field and c(xik,xjk)=
I-yak -xjk[l
for a numeric field. The weighting factors, wk, are computed as N(N - 1)/2 Wk=N_,
N
c
c
i-l
j-i+1
ctxikpxjk)’
where N is the number of records. SOME OPEN QUESTIONS Previous work has clearly established the potential utility of detecting data errors through the use of clustering. However, there are certainly some. open questions. What clustering algorithms are most effective in this application? How generalizable are the approaches to a broad range of data? How
Some experiments in the use of clustering for data validation this approach scale up to larger databases? To what extent are approaches developed for statistical databases appropriate for other kinds of databases? In this note we consider partial and necessarily incomplete answers to these questions. A somewhat larger database is used. Three clustering algorithms, including one developed for this purpose, are compared. Commonly used approaches to improving the efficiency of clustering algorithms, such as estimating weighting factors and breaking the data into smaller chunks, are also used. There are also some open questions of broader scope as well. The effectiveness and cost of this approach can be compared to those associated with alternative approaches to data editing. And the potential role of such approaches in a coordinated operational plan for error control can be considered. These broader questions are not addressed here. well does
GREATEST
DISTANCE
ALGORI’I’HM
Although all clustering algorithms provide procedures which may be used to describe the structure of a set of data, they differ in many ways. Our interest here is in algorithms which are both efficient and effective for this particular application. There is no apparent need to generate a hierarchy of clusters, so a hierarchical approach is not needed. So consideration was given to algorithms with good time complexity; some of the iterative partitioning algorithms can be used to cluster quickly. One such algorithm is the leader algorithm, as described in Hartigan [ 161. The leader algorithm can be used to cluster the data in time 0&V) with the use of only one pass. A threshold for cluster membership is chosen. The first record is selected as the first leader. (Leaders are also called seed points or cluster centers.) Subsequent records are placed in its cluster if their distance to the
539
leader is below the threshold. A record which is too far away is selected as a new leader. Records are placed into the first appropriate cluster. An improved leader algorithm, also described in Hartigan [16], can be used to achieve a clustering which is independent of the original order. An average record is selected as the initial leader. For each pass, the object furthest from its own cluster leader is chosen as the leader of a new cluster; objects are then reassigned as necessary. If k clusters are to be found, k passes through the data are required. Clustering algorithms such as this one which are based upon use of a distance from a cluster center tend to find clusters of spherical shape. While this tendency can be a disadvantage in clustering data which contains clusters of other shapes, it is not necessarily a problem in this application. The greatest distance algorithm is a heuristic modification of the improved leader algorithm which uses a different criterion for selecting new cluster leaders. In the improved leader algorithm, the new cluster leader is the record that is furthest from its cluster leader. In the greatest distance algorithm, it is the record that is furthest from any leader of a non-deviant cluster and is greater than the average record distance from its cluster leader. This algorithm is given in Fig. 3. A non-deviant cluster is defined to be one that has more than one percent of an entire chunk’s records in it, in addition to its leader. The average record distance is used as a criterion for being close to a leader; it is defined as the average distance between records and their leaders. If no record that meets the criteria for selection of a new cluster leader can be found, the algorithm terminates. A schematic comparison of the improved leader and greatest distance algorithms is shown in Fig. 4. Two clusters are shown. Point A is chosen as the next cluster leader by the improved leader algorithm; it is the point furthest from its cluster leader. Point B
Select an initial cluster leader. In the first pass, select the record furthest from the initial cluster leader as another cluster leader. On each subsequent pass, Assign each record to the cluster associated with the closest cluster leader. Select as new cluster leader that record which is furthest from any leader of a non-deviant cluster (one containing more than 1% of the records) and which is greater than the average record distance from its cluster leader. Continue until k passes have been completed OR a new cluster leader which meets the criteria can not be found Fig. 3. Greatest distance algorithm.
W. F. STORER and C. M. EASTMAN
540
MILE POST
NODE NUMBER
0
Fig. 4. A comparison of the improved leader and greatest distance algorithms.
is chosen distance
as the next cluster leader by the greatest algorithm. Even though it is closer to its
cluster leader than Point A is, the distance between it and the furthest cluster leader is greater than for any other point. The greatest distance algorithm is designed to find errors, rather than to form reasonable clusters. It is hoped that by picking out a new leader that is furthest from any other leader of a non-deviant cluster and that is relatively distant from its own leader, the strangest records will be selected as cluster leaders. Prospective leaders are compared only to non-deviant clusters because it is desired to find records that are distant from “normal” records, not from very strange records. At most k passes through the data are required. During each pass, each record must be compared to the set of at most k current cluster leaders. The selection of a new leader requires some additional computation using the distances already determined during the pass. The cluster centers are not updated. Thus the time required is O(k’N). EXPERIMENTAL
DATA
The experiments used 1704 records from files maintained by the Florida Department of Transportation. These records are used to describe the intersections on various roads throughout Florida. Each record contains 10 fields that are used to describe the characteristics of an intersection. Three of these fields were selected for use in the cluster analysis. The first field is called the milepost. It is a five digit number used to describe how far an intersection is from the beginning of a road. The second field is called the node number. This is a five digit number that is used as the unique label for an intersection. Node numbers are usually assigned in roughly ascending order starting at the beginning of the road. The third field is called the clr (compositeleft-right). This field is used to indicate on which side of a divided road an intersection is located. A clr of 1 indicates a non-divided road. A clr of 2 indicates the left-hand side of a divided road, and a clr of 3 indicates the right-hand side of a divided road. The milepost and node fields on the records were treated as numeric fields, and the clr was treated as a coded field. An example of the data used in the study is shown in Fig. 5.
CLR 1A lB*
3% 6637 loooo lo986 11236 11492 11751 12005 12262 12518
a29 a30 a31
12797 13869 13288 14051
a39 i-z a42
f
14051 14101
a42 943
:
14150 14899 15151 15151 15163 15251 15415
913 240 944 945 1037 946 947
i:: ii% 836 a37 838
f 1 1 1
1 1
1
:A
1
:B 1 1
Fig. 5. Example data. A-Correct record flagged; B-record with error flagged; *-this error was not located by the improved leader algorithm.
EXPERIMENTAL
PROCEDURES
Three different clustering algorithms were used in experiments. All of the programs were implemented in Pascal 6000 at Florida State University and were executed on a Control Data Cyber 74 computer system. The first algorithm is the short spanning path algorithm by Lee er al. [Ill. The second algorithm used is the improved leader algorithm described in Hartigan [16]. The third algorithm is the greatest distance algorithm previously described. There are a large number of possible distance measures that have been used or suggested for clustering analysis. All of the reference books on clustering previously listed contain a discussion of distance measures; a recent overview is given in Gower [30]. The distance measure used in the experiments is the same one used by Lee et al. [I 13; this measure is used to allow the results to be compared more easily. The file was subdivided into chunks, and clustering was performed separately on each of these chunks. Since the intersections are normally kept sorted first by the road that they are on and then by their location on the road, no preprocessing had to be done. The data was already sorted, and the intersections on a common road could be automatically grouped together before analysis. The computation of the weighting factors used in the computation of the distance function requires O(N*) time. The computation of N(N - 1)/2 distances for each field on the input record is required. The greatest distance algorithm was also run with estimated weighting factors in order to observe the effect on performance. These estimated weighting factors were calculated by using increments of five to the
Some experiments in the use of clustering for data validation
determine the records sampled. Of course, such estimates could have been used with either of the other two algorithms as well. EXPERIMENTAL
RESULTS
The experimental results are shown in Fig. 6. It should be noted that the weighting factors are computed for each algorithm. Thus the times given all include the time required to compute the weighting factors, which is independent of the time required to perform the actual clustering. Some idea of the impact of this can be seen in the difference between the times required for the greatest distance algorithm with and without the estimated weighting factors. When the weighting factors were computed exactly rather than estimated, more time was spent determining the weighting factors than was spent doing the actual clustering. Both the short spanning path and greatest distance algorithms found five errors in the data examined. Three of these errors were intersections that had been placed on the wrong road by a malfunctioning file maintenance program. The other two errors were the result of data entry errors which caused a single intersection to be labeled with the conflicting node numbers. These errors caused intersections to have node numbers and clr values that were not consistent with those of other nearby intersections on the same road. This inconsistency caused these intersections to form singleton clusters and be detected as errors. It should be mentioned that none of the errors detected by the program could have been detected by a simple range check; these errors could only be detected by considering a record’s relationship to other records. All of the records that were in error here might not be errors if they represented intersections on a different road. Four of these five errors were also found by the improved leader algorithm. One rather obvious error was missed by this algorithm; it is shown in Fig, 5. Although this record was not very close to its cluster leader, it was still closer to its leader than other correct records were to theirs. This caused the error
ALGORITHM
FALSE Esf% ALARMS
TIMJZUSED IN SECONDS
Short Spanning Path
5
29
9-l
Improved Leader
4
63
56
5
30
56
5
30
25
Greatest DiSMllCC
Greatest DiSWlCC.
Estimated Weighting Factor
Fig. 6. Experimental results.
541
not to be selected as a cluster leader when only five clusters were produced. Most of the records flagged as possible errors by the three algorithms were actually false alarms. Many of these false alarms are caused by deviations in node number assignment when an intersection is at the beginning or end of the road, or when it is the intersection of two major highways. Examples of some inconsistencies that are errors and of some that are not errors can be seen in Fig. 5. The number of false alarms found by the greatest distance algorithm (30) is comparable to the number found by the short spanning path algorithm (29). Over twice as many records were flagged by the improved leader algorithm (63). This is due in part to the fact that the algorithm is forced to find five clusters in every subset of data. The use of the greatest distance algorithm provided performance comparable to that of the short spanning path algorithm with improved time requirements comparable to that of the improved leader algorithm. Furthermore, the use of estimated weighting factors provided considerable improvements in time with no impact on performance. The ratio of errors found to false alarms in these experiments (l/6) is lower than that observed in the warship data (l/l). This might be due to the size of the file or possibly to the type of data. At any rate, the percentage of records flagged for further investigation in the road data with the greatest distance algorithm (35 out of 1704) is smaller than the pcrcentage flagged in the warship data (4 out of 50). CONCLUSIONS Clustering programs similar to those described here can be used as part of a more comprehensive data editing system. Although they could certainly not replace more conventional approaches, such as data entry checks, they could provide a valuable supplement to such approaches. The outliers found by clustering often represent data errors. However, they are sometimes simply unusual and correct data points. So it is necessary to screen the output records flagged as outliers; this might need to be done manually. However, the number of such records would be only a small fraction of the entire file. The files used in the experiments described here are relatively small. Consider a much larger file of l,OOO,OOO records. It would take approx. 4 h to screen a file of this size using the current implementation of the greatest distance algorithm. (This algorithm is linear in N, so the times can simply be extrapolated.) A program of this length would need to be run as a batch program if the entire file is to be processed at once. However, it would also be feasible to process the file in smaller chunks. If the same fraction of records were flagged as in the smaller file, one would expect approx. 3,000 errors (0.3%) and 18,000 false alarms (1.8%). While these records
W. F. STORER and C. M. EASTMAN
542
represent a small fraction of the file, it would be a tedious and time-consuming task to screen them manually to determine which flagged records represent actual errors. It would clearly be desirable to develop algorithms that would generate fewer false alarms. The experimental results presented here provide support for the conclusion that clustering analysis can be used as one approach to data editing. Furthermore, the greatest distance algorithm achieved a reasonable combination of efficiency and effectiveness in flagging potential errors. Of course, it is possible that other clustering algorithms would do as well or better. REFERENCES
[II A. F. Cardenas. Doto Bose Monogement Systems. Allyn and Bacon, Boston, Mass. (1979).
PI E. S. Loomis. Doto Management and File Processing. Prentice-Hall, Englewood Cliffs, N.J. (1983).
[31 8. J. Sal&erg, An Introduction to Doto Base Design. Academic P&s, Orlando, Fla (1986).
141G. Wiederhold. Database Des&n. McGraw-Hill. New York (1983).
PI I. P. Felligi and D. Holt. A systematic approach to
automatic editing and imputation. J. Am. Stat. Assoc. 71, 17-35 (1976). PI R. J. Freund and H. 0. Hartley. A procedure for automatic data editing. J. Am. Stat. Assoc. 62, 341-352 (1967). I71 J. Naus. Data Quality Control and Editing. Marcel Dekker, New York (1975). 181J. I. Naus. T. G. Johnson and R. Montalvo. A probabilistic model for identifying errors and data editing. J. Am. Stot. Assoc. 67, 943-950 (1972). PI 1. Wright (Ed.) Statistical Methocis and the Improvement of Data Quality. Academic Press, Orlando, Fla (1983). WI M. R. Anderberg. Cluster Analysis for Applications. Academic Press, New York (1973). u 11 R. C. T. Lee, J. R. Slagle and C. T. Mong. Towards automatic auditing of records. IEEE Trans. Softw. Engng SE-4, 441-448 (1978). WI W. F. Storer. Clustering analysis in the automatic auditing of records. M.S. Thesis, Department of Math-
ematics and Computer Science, Florida State University, Tallahassee, na., Mar. (1981). 1131 W. F. Storer and C. M. Eastman. Exoeriments on the application of clustering techniques to data validation. P&c. 4th Int. Conf: Inf&mation- Storoge and Retrieval, DD. 88-89. Oakland. Calif.. Mav-Jun. (19811. 1141K S. Duran and P. i. O’Dell:Ciiter Ana&sis:‘A Survey. Springer, Berlin (1974). 1lSl B. Eve&t. Cfuster Analysis., 2nd edn. Halsted Press, New York (1980). Ml J. A. Hartigan. Clustering Algorithms. McGraw-Hill, New York (1975). 1171H. C. Romesburg. Cluster Analysis for Researchers. Lifetime Learning Publ., Belmont, Calif. (1984). WI P. H. A. Sneath and R. R. Sokal. Numerical Taxonomy: The Principles ond Practices of Numerical Cio&cation. Freeman. San Francisco. Calif. (1973). w4 V. Bamett and ?‘. Lewis. Out&s in St>tistiLol Data, 2nd edn. Wiley, New York (1984). WI A. V. Hall. The peculiarity index, a new function for use in numerical taxonomy. Nature 206, 952 (1965). [2ll D. W. Goodall. Deviant index: a new tool for numerical taxonomy. Nature 210, 216 (1966). [22] P. MacNaunhton-Smith. W. T. Williams, M. B. Dale and L. G. Mockett. Dissimilarity analysis: a new technique of hierarchical subdivision. Norure 202, 1034-1035 (1964). -1231 . J. MacQueen. Some methods for the classification and analysis of multivariate observations. Proc. 5rh Berkeley Symp. Mathematical Statistics and Probability 1, 281-297 (1967). i241 G. H. Ball and D. J. Hall. A clustering technique for summarizing multivariate data. Behac. Sci. 12, 153-155 (1967). P51D. Wishart. An algorithm for hierarchical classifications. Biometrics 20, 165-170 (1969). PI J. R. Slagle, C. L. Chang and S. d. Heller. A clustering and data-reorganizing algorithm. IEEE Trons. Systems Man Cyber. SMC-5, 125-128 (1975). (271 M. V. Bhat and A. Haupt. An efficient clustering algorithm. IEEE Trans. Systems Mon Cyber. 6, 61-64 (1976). [28] W. T. McCormick, P. J. Schweitzer and T. W. White. Problem decomposition and data reorganization by a clustering technique. Ops Res. 20, 993-1009 (1972). [29] J. R. Slagle, C. L. Chang and R. C. T. Lee. Experiments with some cluster analysis algorithms. Pattern Recog. 6, 181-187 (1974). [30] J. C. Gower. Metric and Euclidean properties of dissimilarity coefficients. J. Classif. 3, 5-48 (1986).