Quality of indexing in online databases: An alternative measure for a term discriminating index

Quality of indexing in online databases: An alternative measure for a term discriminating index

03~.4S73/88 $3.00 + .@I3 Copyright 0 1988Pergamon Press plc QUALITY OF INDEXING IN ONLINE DATABASES: AN ALTERNATIVE MEASURE FOR A TERM DISCRIMINATING...

258KB Sizes 0 Downloads 30 Views

03~.4S73/88 $3.00 + .@I3 Copyright 0 1988Pergamon Press plc

QUALITY OF INDEXING IN ONLINE DATABASES: AN ALTERNATIVE MEASURE FOR A TERM DISCRIMINATING INDEX

Isola Ajiferuke and Clara M. Chu School of Library and Information Science University of Western Ontario London, Ontario, Canada

term Abstract. An alternative measure to White & Griffith's the discriminating index is proposed. measure takes The consideration and, unlike collection size of a database into White & Griffith's measure, its value can never exceed 1.

databases is often indexing in online quality of The involving users' tests, effectiveness measured by retrieval White and However, judgements 111. relevance requests and Griffith recently developed an alternative method which aims at the dictated by showing the quality of controlled vocabulary, indexing system adopted by the database producer and the policy involved The method guidelines used to apply the system 121. of documents that are similar in content, identifying clusters searching for each document from a given cluster in a database, to index each databases identifying the terms used by the document, and calculating certain measures to determine indexing A term is assumed to serve a system well if it spans quality. which at least half of the documents in a cluster (i.e., a term index half or more of the documents in a cluster), is used to and has a discriminating index of at least 0.25. the measures The discriminating index or value of a term use of the term will help to distinguish degree to which the not use documents from each other 131. White and Griffith did term discrimination value developed by Salton et al because the documents it appears computable only for a small "test bed" of their own measure. in the laboratory [21, and instead developed Their discriminating index, a decimal fraction between 0 and 1, is defined as: Discrimination

Index

However,

this

Postings

= 1 / log 10

Term A index has two major

Term A

problems:

1) The measure can exceed 1, contrary to the authors' assertion, term with a number of postings less than 10 Any in a database) (i.e., a term describing less than 10 documents would have a discrimination index greater than 1. In a study to replicate the authors' methodology, we compared the quality of indexing of Library Literature Information Library and (LL), Science Abstracts (LISA), and Information Science Abstracts (INFO) databases using clusters of library and information science literature [4]. It was discovered that in LL, "Macroeconomics" value of had 5 postings (i.e., a discrimination l/log10 5 = 1.431). 599

Brief Communication

600

2) The does size of the measure not take collection databases into consideration. from different databases Terms have could the same value but if the databases varied in size, the value could no longer form a good basis for comparison. For example, "Classification Schemes" in LISA had 391 postings and have "Library Instruction" in INFO had 391 postings, both the same index value of 0.386 but their databases differ in size: 83,450 and 119,400 records, respectively. terms from a Also, smaller database would in most cases have higher discrimination values than a larger database. terms with a For example, no number of postings greater than 10,000 was found for LL (31,000 records), which is the number of postings required to fall below the 0.25 discriminating threshold value. A term with such a low index value, if found in the databases at the time of the study (August 19871, would be indexing greater than 32% of all records in LL, greater than 12% of all records in LISA, and greater than INFO's 8% of records which suggests that this index is more appropriate for size. evaluating databases of comparable The highest number of postings for an LL term in this study was 210, which is nowhere near the 10,000 posting mark. However, the other larger databases do show high number of postings which 3 represent poor discriminators: were found in LISA ("Services", "Research", and "Librarianship" had 28,289, 11,884, and 11,808 postings, respectively) and 2 were INFO found in ("Research" and "Software" had 18,306 and 10,775 postings, respectively). As a result of the stated problems with White & Griffith's term discrimination index, we hereby propose an alternative index which takes into account size of the the database to arrive at an appropriate discriminating index value for comparing databases of any varying size. This index is defined as: Discrimination

Index

/ Size of database

= Postings Term A

Term A

The index also ranges between 0 and 1, and the smaller the The index assumes a index, the more discriminating the term. value of 1 if the term is used as a descriptor for all documents in the database and it assumes a value of 0 if the term is not the all in the For this measure, used at database. discrimination threshold is set at 0.05 and, like the White and but is appropriate Griffith value, was selected arbitrarily any term with a value above 0.05 would be indexing more because So far the thresholds than 5% of the documents in any database. that have been set only act as a dividing line between good and terms poor discriminating terms, however, we that found many in addition, also existed Therefore, that over discriminated. discriminating we set a limit to separate those overly out terms. For our index, good discriminating terms have values ranging between 0.001 and 0.05, with values in the lower end of For those wishing to use the scale being better discriminators. size, the White and Griffith index with databases of comparable the scale is set at 0.25 and C.75, with values at the upper end of the scale being better discriminators. The discrimination index proposed by the used to re-evaluate the data from the White for the following reasons:

might be writers and Griffith study

MEDLINE only the their discrimination index, 1) Using 1973-1979 and BIOSIS PREVIEWS 1969-1976 comparison is acceptable 1.98 and because they are similar in size, 1.7 million records million comparisons of MEDLINE records, Their respectively. with records) and with Excerpta Medica 1974-1979 (1.2 million

601

Brief Communication

valid because PsycINFO 1967-present (400,000 records) are not a) terms with the same index value would have the same number of postings but would actually be indexing different percentages of not likely have b) PsycINFO would documents in a database, and more than 10,000 which describing too many terms would be than lower documents, which is needed to attain an index value 0.25. 2) Terms which overly discriminate have been considered good discriminators because only the 0.25 threshold value was used to A more appropriate measure determine of discrimination. power discriminating would be one which identify these over can indexing terms. Thus, we suggest that an upper boundary be set to identify over discriminators, and omit them from the list of good discriminators. discrimination index suggested by the writers serves as The different sizes are an alternative measure when databases of compared using White and Griffith methodology for being the determining quality of indexing. In addition, our index is a more discreet measure because terms which poorly and overly discriminate are identified allowing only good discriminators to be considered in the evaluation.

REFERENCES

1. Sparck Jones, K. Retrieval system tests 1958-1978. In: Sparck London: Jones, K., editor. Information retrieval experiment. Butterworths; 1981. 2. White, H.D. and Griffith, B.C. Quality of indexing in online databases. Information Processing & Management, 23(3): 211-24; 1987. 3. Salton, G. and McGill, M.J. Introduction retrieval. New York: McGraw-Hill; 1983.

to modern

information

4. Chu, Clara M. and Ajiferuke, Isola. Quality of indexing in library and information science databases. London, Ont.: School of Library and Info. Science, U.W.O., 1987; 33 p. [Unpublished]