030674379/87 $3.00 + 0.00 Pergamon Journals Ltd
Inform. Systems Vol. 12, No. 3, pp. 239-242, 1987 Printed in Great Britain
DUPLICATE RECORD IDENTIFICATION BIBLIOGRAPHIC DATABASES PANKAI
IN
GOYAL
Department of Computer Science, Concordia University, Montreal, Canada H3G IMS (Received 22 August 1986; in revised form 20 February 1987)
Abstract-This study presents the applicability of an automatically generated code for use in duplicate detection in bibliographic databases. It is shown that the methods generate a large percentage of unique codes, and that the code is short enough to be useful. The code would prove to be particularly useful in identifying duplicates when records are added to the database. 1. INTRODUCTION
Whenever
a record is to be added to a database
it is
necessary to determine whether the record already exists in the base. This is imperative to maintain the integrity of the base. Duplicate records in a database not only waste storage space but also deteriorate retrieval and indexing performances. In the past duplicate records have been detected manually. The librarians filling new cards made the decisions based on local cataloguing rules. But as databases grow in size or when it is required to merge a number of databases into a single database, the problem becomes manageable only by computers. The situation is made worse when the database is large and there are a large number of users. For this reason, the matching of bibliographic databases has been a concern of a number of library networks. The decision process for duplicate detection would still be a local one and there would always be some “errors” because of the different definitions of a duplicate. The common denominator of all the different duplicate detection schemes is the task of unique record identification. In the absence of an embedded (preassigned or generated) unique key, identifying information is extracted and/or generated from existing record descriptions. As the goal of automatic duplicate detection is to reduce, if not eliminate, human intervention, this identifying information, or key, is designed for maximum uniqueness. 2. PREVIOUS
WORK
At the Oak Ridge National Laboratory [l] efforts have been made to find duplicates in bibliographic journal citation files. The algorithm generated fixed length keys of date, page, Soundex coded first author’s name, Coden code, 8 character code of Journal, volume and a 8 character (all consonants) title. Duplicate detection was carried out with sorting on different key elements and the use of weighted matching. The scheme is claimed to have performed “well”. Williams and MacLaury [2,3], at the University of Illinois, in their attempt to merge MARC files also
considered duplicate detection. The duplicate detection process was divided into two passes. In the first phase a key generated from the title and date elements was used. Letters from “high entropy” positions in the title words were used. Duplicates detected in this pass were subjected to author, full title (Harrison-Hamming) test and pagination matching. Their study also revealed several variations in cataloguing. Hickey and Rypka [4], at the OCLC, have tested a duplicate detection algorithm using a 52 byte key per record, divided into exact and partial match sections. The duplicate detection algorithm was controlled by a decision table. The exact match section was used to group related keys so as to minimize full key comparisons. By random checking, it was estimated that 54 or 69% of the actual duplicates were detected, depending on whether reprints with different imprint dates were considered as valid duplicate records. A number of errors were identified and a lot of “cleaning” of fields is done before key generation. 3. UNIVERSAL
STANDARD
BOOK CODE (USBC)
Yannakoudakis et al. [5] present the history and the salient features of the USBC. The USBC (Fig. l), as used in this study, is a 17 character code generated from a group of MARC record fields. Individual field codes, except for title, were generated by simple selection and truncation methods, and in the language case table look-up. The title element was coded using three different methods. Yannakoudakis [6] presented a coding scheme based on maximum self-information (i.e. least frequent symbols), and Goyal [7,8] advocated a method based upon the maximum entropy principle [9]. Goyal [lo] compared the coding efficiencies of various different coding methods and showed that the maximum entropy coding scheme gives the best performance of all. For this study three different coding methods are used on the title field; in addition to the maximum self-information and maximum entropy methods, the code for the title element was also 239
240
PANKAJ GOYAL
Element
Title Tl T2
Type
A
AN
Size
5
2
Fia. 1. Structure of the
Publisher
Language
Edition
Volume
Date
Weight
A
N
N
AN
N
N
2
1
1
2
3
1
USBC code. Tl and T2 are title elements; T2 may include digits, A: alphabetic, N: numeric, AN alpha-numeric.
generated using a starting frequency of 2 (see Appendix). In a manual analysis of some of the records that generate the same code on the title entries [7], it was found that these were mostly caused by corporate entries or included in their title fields some distinguishing digits. Goyal [7] showed that including the corporate entry information, contained in tag 110 of the MARC record, as part of the title improved coding performance by about 2%. More importantly, it reduced the “degree of collisions” (i.e. the maximum number of records that generate the same code) in the tests from 29 to 6. The publisher element was coded from tag 260 subfield “B” by selecting the initial letters from the tirst two words or where there is a single word entry the first two letters of the word. In the database there are a number of inconsistencies in the use of abbreviations, and these affect the coding schemes. For example, while the coding scheme, and rightly so, will not distinguish between “J. Wiley”, “J. Wiley and Sons”, “ John Wiley”, etc., it will however distinguish them from “Wiley”, “Wiley and Sons”, etc. The language was coded from the fixed length tag 008 using the table given in Ayres [l 11. The edition was coded using tag 250 “A”. The last digit was used as the edition; if no digits were found then the edition was taken to be 1. From a random manual analysis of records that generated the same code, it was very doubtful if there would be any gains by using a more complicated algorithm. If it is very critical then the program can check every word in the field to see if they represent numerals. The last three digits of tag 260 “C” were picked for the date field. The volume/part element was generated from tag 245
Table
“G”. Words like “volume”, “part” or any of their abbreviations, the suffixes “St”, “ml”, “rd” and “th” being neglected. The weight is the length of the title field modulo 10. 4. RESULTS
The scheme was tested with the title element being coded using the maximum self-information, starting frequency = 2, and the maximum entropy method, hereafter, referred to as the USBC-I, USBC-2 and USBC-E methods, respectively. The tests were performed on MARC files obtained from the British National Bibliography for the years 1971 and’ 1975, referred to as BNB71 and BNB75, and an OCLC 1977 file, referred to as OCLC77. BNB71 had 30651 records, BNB75 31369 and 0CLC77 44397 records. The results in Table 1 are compared with the results reported in [5] and show the benefit of including the data in the corporate entry (tag 110) and digits in the title element. The results also show the discriminating power of the code; the title coding element “Tl” (Fig. 1) contributes substantially to the discriminating power. The coding schemes USBC-I, USBC-2 and USBC-E, differ only in their coding of the title element Tl. Goyal [lo] presents performance analysis of the different title coding schemes. For example, on the BNB71 database, the different coding schemes for title element Tl (code length 5) had performance differences of upto 17% (Table 2). From the BNB71 and BNB75 bases all records that generated non-unique codes were extracted and manually examined. Table 3 gives data about the actual duplicates and the coding failures. The failures were
I. Comparative performance of USBCs with different title coding elements
Method
USBC-I
USBC-2
USBC-E
Database
Yannakoudakis et a1.[51
BNB71
: C
99.86 99.93 4
99.86 99.93 4
99.91 99.95 2
99.59 99.78
BNB75
: C
99.74 99.87 3
99.74 99.87 3
99.78 99.89 2
99.64 99.82
98.87 99.43 3
98.90 99.44 3
98.93 99.45 3
OCLC77 : C
n: % of unique codes; d: % of distinct eodcs (some codes may be generated by more than one record); c: degree of collisions.
Duplicaterecord identification in bibliographic databases
241
Table 2. Performance analysis for different coding schemes for the title element TI (code length 5)
BNB71 (6260 randomly selected records)
:
USBC-I
USBC-2
USBC-E
70.91 81.50
84.33 90.75
87.75 92.84
Table 3. Analysis of records Ragged as duplicates in the BNB71 and BNB75 databases Database
USBC-I and USBC-2
Coding
Duplicates Triplicates Quadriplicates
BNB71
USBC-I and USBC-2
USBC-E
BNB75
USBC-E
I
II
I
II
I
II
1
II
19 0 11
12 0
14 0 0
7 0 0
38 :
18 :
34 0 '0
14 0 0
Number of records involved
68
79
28
42
20
7
Number of records deletable from the base
I: pairs flagged by the codes; II: coding failures.
mainly due to distinguishing
information not utilized by the USBC and in some cases due to errors in the records. An analysis of the OCLC77 base using the different USBC methods gave the results of Table 4. 27 records (from 27 duplicate pairs) were a microfilm copy of the original and would be classified as duplicates by some and not others. In a study of 33 of the records (15 duplicates and 1 triplicate) flagged differently by the USBC-I and USBC-2 methods, four pairs of records were found to be different only in their spellings of a single word: (i) DU (ii) DEL (iii) BECAME (iv) LENINSMES
DE DAL BACAME LENINSMUS
One of the records in the triplicate group did not contain the corporate entry field. The results from this and similar analysis of 34 other records are given in Table 5. By studying the duplicates flagged by a coding of the title element only but not flagged by the full USBC codes, errors and variations in the data recorded on the MARC tapes were identified. These errors were similar in nature to those identified in the Illinois study [3]. The variations are in languages, recording medium, publisher, editions and dates, and the errors were in the language field, publisher differences (with place name), missing edition information, partial publisher names, and the inclusion of different digits in the title field. Only a few spelling errors tiere identified because the coding method used for the titles is completely immune to trans-
Table 4. Number of record pairs flagged as duplicates in the 0CLC77 base USBC-I Duplicates
USBC-2
USBC-E
236
231
224
9
8
8
Triplicates
Table5. Comparisonof a sample of identified pairs from the 0CLC77 base Pairs Studied
Method
Coding Failures
Actual Duplicates
USBC-I only
Duplicates Triplicates
10 1
6 0
4 1
USBC-2 only
Duplicates Triplicates
!
4 0
;
Duplicates Triplicates
:
:
;
Both USBC-I and USBC-2
242
PANKAJ &YAL
positional errors (i.e. the correct set of characters but in the wrong positions), and partially immune to the extra and missing characters if these characters occur with a frequency not used by the coding scheme, typically the most frequent characters for all three schemes. Hickey and Rypka [4] while reporting a duplicate detection of 3% estimate the actual duplication to be between 7 and 8% in the OCLC 1976 base. The USBC identified only about 1% duplication on the OCLC 1977 base. To verify the results obtained a test was carried out using just the title field of the records; a 20 character code was generated from the titles. The 20 character code gave a duplication of about 3%. The first 2.5 pairs of records from this study not flagged as duplicates by the USBC methods were analyzed. The analysis showed variations in the manner of recording data and also input errors (see above). 5.
DISCUSSION
The analysis has shown that duplicate detection would require record standardization and the correction of errors, both very expensive processes if applied to all existing databases. This must be weighed against the cost of storing duplicates and imperfect records. For all new records added to the database stringent validation is a necessity. The validation task can be undertaken at an intelligent terminal (workstation). The characteristic required of the USBC is to uniquely identify differing records and also serve to detect duplicate records; records which while possibly different, maybe because of errors, are to be classed as the same. The very high discriminating power of the USBC can be used for automatic duplicate detection on error-free bases. It has been shown in this study that the number of records in need of human intervention would be less than 0.2% The false drops could be reduced by an expansion of the USBC, by encoding more elements and/or using longer codes for the elements. However, these codes would have to be maintained in the system and for the very large bibliographic databases this may be quite expensive. The performance can also he improved by using a confidence measure on code elements. Thus, two records whose codes differ only for the elements known to be more prone to errors, should be flagged for further examination. The title element provides some 95% of the discriminating power of the USBC code [7]. The discriminating power of the other elements as well as the effect of controlled errors on the disc~minating power of the title elements codes need to be investigated. Acknowledgements-The
help of Mr F. H. Ayres, Dr J. A. W. Huggill and Dr E. J. Yannakoudakis is gratefully
acknowledged.
REFERENCES [l] C A. G&s, A. A. Brooks, T. Dozkocs and D. J. Hummel. A computer&d scheme for duplicate checking of bibliographic databases. ORNL-CSD 5( 1976). [2] M. E. Williams and K. D. MacLaury. A state wide union catalog feasibility study. Final Rept. Univ. of Illinois (1976). [3] M. E. Williams and K. D. MacLaury. Automatic merging of monographic databases-identification of duplicate records in multiple files: the IUCS scheme. J. Libr. Auromn 12(2), 1.56168 (1979). [4] T. B. Hickey and D. J. Rypka. Automatic detection of duplicate monographic records. J. Lib. Aulomn 12(2), 126-142 (1979). [S] E. J. Yannakoudakis, F. H. Ayres and J. A. W. Huggill. Character coding for bibliographic record control. Cornput. J. u(l), S-60 (1980). [6] E. J. Yannakoudakis. Towards a universal record identification and retrieval scheme. J. Znformusics 3, 7-11 (1979). [7] P. Goyal. Computer
coding processes to aid bibliographic record control and storage. Ph.D. thesis, Univ. of Bradford (1981). [8] P. Goyal. The maximum entropy approach to record abb~~ation for optimal record control. Zn&rm. Process. Mgt I9 (2) 83-85 (1983). [9] E. T. Jaynes. Information theory and statistical mechanics, Parts I and XI. Phys. Rev. 106,620-630; l(#(, 171-190 (1957). [IO] P. Goyal: An’investigation of different string coding methods. J. Am. Sot. Inform. Sci. 35(4), 248-252 (1984).
[l I] F. H. Ayres. The Universal Standard Book Number (USBN): a new method for the construction of control numbers for bibliographic records. Program 8(3), 166-173 (1974).
APPENDIX
The Tl element of the code (Fig. 1) is generated by selecting letters based on their frequency of occurrence in the title. The distinct letters constituting the title can be visualized to be arranged in a cyclic buffer (see Fig. 2); the elements in the buffer being equifrequent groups of letters. l,P,S
2
\
9
i
3
G:
Fig. 2. Frequency groups obtained after analyzing the title of this article. The numerals on the side of the boxes indicate the frequency of the group. Figure 2 shows the equifrequent groups obtained after analyzing the title of this article. Coding starts by selecting letters from within a particular group (starting frequency, SF; for USBC-I: SF = 1, USBC-2: SF = 2, and for USBC-E SF is obtained using the method in 81) and continues until either the required code length is reached or the letters in the group are exhausted. In the-latter case, the coding continues with selection of letters from the “next” group. If all groups are exhausted, the generated codes are padded with blanks.