An investigation of document structures

An investigation of document structures

Informorion Proces~ng & hfonagemenr Vol. 26, No. 3. pp. 339-348. 1990 Copyright 0 Printed in Great Britain. AN INVESTIGATION 0306473/90 13.00 + .o...

879KB Sizes 10 Downloads 44 Views

Informorion

Proces~ng & hfonagemenr Vol. 26, No. 3. pp. 339-348. 1990 Copyright 0

Printed in Great Britain.

AN INVESTIGATION

0306473/90 13.00 + .oo 19% Pcrpamon Press plc

OF DOCUMENT STRUCTURES* W. M. SHAW, JR.

School of Information and Library Science, University of North Carolina, Chapel Hill, NC 27599-3360, U.S.A. (Received 3 April 1989; accepted in final form 1I July 1989)

Abstract-The presence of clustering structure in a document collection and the influence of the presence of clustering structure on the success of cluster-based retrieval are investigated as a function of term-weight and similarity thresholds. The term-weight threshold selects a particular level of indexing exhaustivity for the document representation, and the similarity threshold selects a specific level of the associated single-link hierarchy. Results show clear evidence for clustering structure in the most exhaustive and

the least exhaustive subject representations. Results also show that observed values of cluster-based retrieval effectiveness at all exhaustivity levels can be explained by assuming that the pairwise associations responsible for the structure imposed on the document collection are generated randomly. The results suggestthat the structure imposed on a small document collection by an automatically produced subject representation is unrelated to the structure imposed on the documents by relevance relationships. I. INTRODUCTION

Clustering is a useful tool in exploratory data analysis and is intended to reveal natural groupings or clusters in a set of data elements [l]. Numerous clustering strategies and algorithms can be applied to a set of elements, when each element is described by a list of features, and any of these applications can be expected to produce clusters. Whether the results manifest an inherent clustering structure within the data or merely represent an artifact of the clustering procedure, cannot be reliably detected by observation, nor can it be inferred from anecdotal interpretations. The results of cluster analysis require validation, and the first step in a validation process is to confirm that the data possess a clustering structure [2]. Because the empirical significance of a clustering outcome can be defined and measured, cluster-based retrieval provides a context for investigating the implications of cluster validation [3]. Two questions will be addressed in this study. Does the presence or absence of a clustering structure in a set of documents change as a function of the exhaustivity of a subject representation? Is the presence or absence of a clustering structure related to the success or failure of cluster-based retrieval? 2. BACKGROUND

In a previous investigation, the empirical significance of document partitions was evaluated by the extent to which documents relevant to the same query were grouped together in a single-link hierarchy [4]. Isolated relevant documents were selected when no cluster of two or more documents produced superior retrieval effectiveness. In the present study, both the empirical and statistical significance of the clustering results are evaluated, and only clusters of two or more documents are selected. Both the previously mentioned study and the current study are based on a collection of 250 master’s degree papers in library science and a group of 22 queries with relevance evaluations [5]. Derived from an analysis of the abstracts and titles, index terms representing subject concepts were assigned to each paper in the Library Science (LS) Test Collection by a variant of the FASIT automatic indexing system [5,6]. A specificity weight based on the conventional inverse document frequency formulation was assigned to each term *Supported in part by the National Science Foundation under grant No. ET-86-03214. ml

26-3-A

339

WM.

340

SHAW,

JR.

Table 1. LS collection statistics Term-Weight Threshold (TW)

Number of documents with one or more terms Minimum number of terms Maximum number of terms Average number of terms

0

10

20

30

250 6 28 15.7

250 6 27 14.8

250 5 25 13.0

250 3 21 10.2

40

50

250

250

I

1

16 7.1

12 5.0

60 235 1 9 3.2

[7]. The weight of term k in document i, denoted wki, is directly proportional to the frequency of occurrence of term k in document i and inversely proportional to the number of documents to which the term has been assigned. A term-weight threshold, denoted TU: is used to control the exhaustivity and specificity of document representations; term k describes document i only if WkiL TW. In Table 1, collection statistics are given as a function of the term-weight threshold. Because all terms are excluded from some documents when TW 2 60, subsequent analyses will be limited to the range 0 I TW I 50. The collection is small by retrieval standards but is large by standards associated with random graph analysis, as will be shown. In this study, the similarity of documents i and i, denoted S/j, is based on Dice’s Coefficient [3]. A similarity threshold, denoted TS, is used to specify the degree of correspondence between document representations required to establish a pairwise relationship; document i is related to document j if Sij L TS. A document graph is composed of the document set, referred to as the point set, and unordered pairs of distinct, related documents, referred to as lines. For any fixed value of TW the number of document pairs for which Sij 2 TS defines the number of lines in the document graph, and the number of lines is denoted by q [S]. Combinations of term-weight and similarity threshold values will cause the documents to be grouped into disjoint clusters or components. The result is a disconnected graph in which the points are distributed among the components. In this case, a partition is produced such that C Pi = P9

(1)

i=l

where s is the total number of components in the graph, p is the total number of points in the graph, and pi is the number of points in, referred to as the order of, the ith component. Components of order one (pi = 1) are referred to as isolated points, and a graph composed of one component (s = 1) is said to be connected. The set of distinct partitions across all the values of the similarity threshold (TS), for a fixed value of the term-weight threshold (TW), constitutes a single-link hierarchy which can vary from a single component at low values of TS to a group of isolated points at high values of TS [8].

3. EMPIRICAL

SIGNIFICANCE

In the present study, the empirical significance of partitions, produced by combinations of term-weight and similarity thresholds, is evaluated by the extent to which documents relevant to the same query appear in the same cluster, a component of order two or greater, and documents relevant to different queries appear in different clusters. The empirical significance of the hierarchy is also evaluated. Evaluations are based on an effectiveness measure commonly employed in clusterbased retrieval and given by E = 1 - 2/(1/P

+ l/R),

(2)

An investigation of document structures

341

where P and R are the standard precision and recall measures. Equation 2 can also be written as the normalized symmetric difference between the set of documents in the retrieved cluster and the set of documents relevant to a query. In that form, the equation reduces to the complement of Dice’s Coefficient. The magnitude of E varies from zero to one and is inversely related to retrieval performance. When E = 0, all documents relevant to a query appear in the retrieved cluster and no other document appears in the cluster, and when E = 1, no relevant document appears in the retrieved cluster [3]. Consider the document graph and associated partition of the document set defined by a combination of term-weight and similarity thresholds. In this study, every cluster composed of two or more documents is evaluated as a response to a query, and the “best” cluster is selected. That is, the effectiveness measure E is computed for all clusters and the smallest value identifies the most favorable response to the associated query. The lowest value of E for each query is computed, and the average of these values, denoted by EP, is used to characterize the empirical significance of the partition. In the context of clusterbased retrieval, the magnitude of EP represents the optimal retrieval effectiveness of a strategy that is constrained to retrieve one cluster for each query from one level of a singlelink hierarchy. The empirical significance of the hierarchy, denoted by EH, is computed by averaging the lowest E values produced by selecting clusters from any level of the hierarchy. The magnitude of EH represents the optimal retrieval effectiveness of a cluster-based strategy that is constrained to retrieve one cluster for each query from any level of the associated hierarchy.

4. STATISTICAL “SIGNIFICANCE”

In any labeled graph with p points and q lines, referred to as a labeled (p,q) graph, the total number of possible lines, denoted by Q, is given by the binomial coefficient and reduces to Q =p(p - 1)/2. Let (G,,,) denote the set of all possible labeled graphs with p points and q lines. The number of elements in this set, denoted by GP,q, is given by the number of ways q lines can be selected without replacement from the Q possible lines, and is also derived from the binomial coefficient. A random (p,q) graph can be constructed by selecting q lines at random from the Q possible lines so that all elements in ( G,,, ) have the same probability of occurring, given by l/GP,q [8,9]. It is assumed under the Random Graph Hypothesis (RGH) that the lines of a (threshold) graph are randomly selected from the set of all possible lines [ 1,2]. This assumption constitutes a null hypothesis which defines a set of points and lines for which there is no “clustering structure.” Given the RGH, analytical formulations [lo] or Monte Carlo simulations [11,12] can be used to compute expected properties of a random graph. For example, the exact distribution for qm, the minimum number of lines at which a random graph with p points becomes connected, has been obtained analytically for p 4 100 [13,14]. The results are available in a table which gives the probability P( qm I q 1p) as a function of q for fixed values of p [14]. These results have been used to assess the presence or absence of clustering structure in chemical classifications [15]. The expected number of components in a random graph has been computed from exact distributions and from asymptotic solutions to combinatorial relationships for values of p in the range 10 I p I 100 and values of q in the range 10 I q I 220, and these results are also available in tabular form (14,161. In either case, observed values of the statistic can be compared to expected values and “significant,” observable differences can be taken to indicate the presence of a clustering structure in the data. Because of limitations in the theory, including no definition of a “real” cluster, “significance” is generally evaluated by observation rather than by formal hypothesis testing (141. For clustering exercises involving more than 100 points, it is necessary to rely on computer simulations or computations based on recursive formulations, which can also be formidable. In the present application, a Monte Carlo simulation implemented by a program called RANEFF and executed on a microcomputer is used to generate the desired statis-

W.M.

342

SHAW, JR.

tics. The program constructs a specified number of random graphs with specified numbers of points and lines and computes a variety of summary statistics. Component size estimates produced by RANEFF have been validated by comparisons to published values cited previously. In the present application, 225 random graphs were generated, the number of components and the retrieval effectiveness for the associated partition were determined for each random graph, and the expected values of these statistics were computed from the resulting distributions. Each random graph consisted of 250 points and a variable number of lines determined by the threshold values TW and TS. The number of lines ranges from 2 to 3,147. The difference between the observed and the corresponding expected number of components, denoted by DS, is used to assess the presence or absence of a clustering structure. The expected value of E, denoted by EX, is compared to the corresponding observed value (EP) to determine if the observed results can be explained on the basis of a random structure. 5. RESULTS

5.1 Clustering structure: DS(T W, TS) Differences between the observed and expected number of components have been computed for levels of the single-link hierarchy associated with six exhaustivity levels in the range 0 5 TW 5 50. The results for DS( TW, TS) are given in Fig. 1. The obvious differences between observed and expected numbers of components for each TW value imply that a clustering structure is present for all levels of exhaustivity. In the absence of a theoretical interpretation of “clustering structure,” no more definitive statement is appropriate. However, there is a systematic pattern in these results that must be considered. For each value of TW, the relationship between DS and TS is unimodal. “Small” values of TS allow so many lines to be present that only one clustering result is possible when the singlelink clustering criterion is employed; a single inclusive component for which DS = l-l = 0. As TS is increased, the trend line for DS increases to a maximum value and then de-

70 60

Fig. 1. Difference between the observed and expected of term-weight (TW) and similarity (7-S) thresholds.

number

of components

(DS)

as a function

An investigation of document structures

343

creases to zero for “large” values of TS. “Large” values of TS eliminate all lines from the graph and DS = 250-250 = 0, for this document collection. When q = 1, DS must also equal zero because only one partition is possible. It can also be seen that the range of values for which DS > 0 and the maximum values of DS decrease as the term-weight threshold is increased from TW = 0 to TW = 30 or TW = 40, but the trend appears to reverse for values of TW beyond 40. Although it can be concluded that there is evidence of a clustering structure for all exhaustivity levels, it can be seen that there is less evidence of clustering structure for TW = 30 and TW = 40 than for any other exhaustivity level. The pattern observed in these results can be interpreted from several perspectives. First, it could be assumed that the clustering tendency reflects the structure imposed on the documents by the queries and relevance evaluations. In this case, the best retrieval results would be expected primarily at the exhaustive representation TW = 0, and perhaps secondarily, at the specific representation TW = 50. As shown below, however, the converse is true. The partition that most successfully satisfies the empirical criterion and yields the optimal value of EP occurs at TW = 30. The assumption cannot be entirely discounted, however, because single-link clustering might fail to reveal the inherent clustering structure in the data. Some improvement in cluster-based retrieval performance has been reported when alternatives to the single-link clustering criterion were employed and these results were derived from test collections with exhaustive subject representations [ 17,181. It is unknown how these alternative clustering methods would perform at different levels of exhaustivity. Second, because the meaning of “clustering structure” is not defined, the presence of “clustering structure” cannot be taken as evidence that any particular empirical interpretation is confirmed. For example, the presence of clustering structure in the LS Test Collection does not ensure that documents relevant to the same query appear in the same cluster and documents relevant to different queries appear in different clusters. Moreover, the systematic pattern shown in Fig. 1 does not necessarily reflect degrees of correspondence to a single empirical interpretation. The clustering results might be more “chaotic” than uniform. That is, the meaning of clustering results might vary as a function of TW. For some values of TW, the results might represent a successful subject classification; for other values, the results might capture aspects of the structure imposed on the documents by a relevance relationship. This interpretation suggests that the “relevance structure” appears only in the vicinity of TW = 30 and that single-link clustering is capable of detecting the empirically meaningful results. There is a third possibility. The clustering structure manifest in these data might be unrelated to the structure associated with relevance evaluations. In this case, no clustering technique can be expected to produce results that are superior to those that would be expected under the RGH. All clustering techniques might be expected to produce equally poor results, although the comparable results might occur at different TW values for different clustering techniques. If this interpretation is correct, a statistical explanation for the relative retrieval performance at the various exhaustivity levels should be available. Results presented below support this interpretation and provide a tentative explanation for an entire class of cluster-based retrieval results. 5.2 Observed

effectiveness:

EP(TW, TS) and EH(TW)

Observed values of the effectiveness measures have been computed for levels of the single-link hierarchies associated with six exhaustivity thresholds in the range 0 I TW I 50. The effectiveness of each hierarchy has also been computed. In this study, a normative measure suggested by Sparck-Jones is used to evaluate differences between effectiveness values; only differences of ten percent or more are considered to be “material” [19]. The results for EP(TW, TS) and EH(TW) are given in Fig. 2. The data points give the values of EP as a function of TS for fixed values of TW. EH(TW) is shown as a straight line below the hierarchy to which it is associated. The length of the line includes all levels of the hierarchy from which a cluster was selected as a response to a query. These results reveal several patterns of interest in this study. First, the relationship represented by EP(TW = constant, TS) shows a similar pattern

344

w&f.

SHAW,

JR.

Fig. 2. Cluster-based effectiveness values EP( TK 7s). EX( TW, 7S), and EH( TW) shown respcc. tively as data points, solid curve, and solid straight line.

for all values of TW. For each value of TW, there is a low performance plateau for low values of TS. The lowest performance level for small values of TS occurs when all documents are connected and the entire collection is retrieved for each query. As TS is increased, the magnitude of EP decreases rapidly to a minimum value. The minimum value of EP, denoted by EPM, identifies the level of the associated single-link hierarchy which maximizes retrieval performance of a cluster-based strategy constrained to operate at one level of the hierarchy. As TS is increased further, a second low performance plateau occurs when the partition is composed of a set of isolated points for which EP = 1. Thus for each level.of indexing exhaustivity, a level of the hierarchy is identified which most successfully satisfies the empirical criterion. The second pattern to be emphasized is shown by EPM as a function of TW. As shown in Fig. 2, EPM decreases as TW is increased from TW = 0 to TW = 30. The trend reverses for TW > 30 and reveals an optimal partition, denoted by EPO, within the hierarchy associated with TW = 30. The optimal partition is defined by the level of exhaustivity and the level of the associated hierarchy for which documents relevant to the same query are most successfully connected and documents relevant to different queries are most successfully separated. Finally, the pattern revealed by EH(TW) is of interest. Although the magnitude of EH for TW = 10 is less than any other value, there is no material difference between any two EH values. The complex hierarchy for the exhaustive representation is of no greater utility than the hierarchy associated with the specific representation at TW = 50. For the LS Test Collection and the single-link criterion, a cluster-based retrieval strategy constrained to retrieve one cluster for each query, but free to select the cluster from any level of a hierarchy, produces comparable results at all exhaustivity levels. Because EH is essentially constant as a function of TW and EPM varies systematically as a function TW, there is a surprising relationship between EH and EPM. For TW = 30 and only for TW = 30, there is no material difference EPM and EH. That is, for TW = 30, cluster-based retrieval effectiveness derived from a partition of the document collec-

An investigation of document structures

345

tion associated with one level of the single-link hierarchy produces a performance level that is comparable to result derived from the entire hierarchy. 5.3 Expected effectiveness: EX(TW, TS) The six curves in Fig. 2 represent EX(TW = constant, TS). The pattern revealed by the expected values of effectiveness are clearly similar to the observed values. Indeed, the expected values, the values based on a random structure, are superior to the observed values in the vicinity of EPM for the exhaustive representation, TW = 0. There is no material difference between corresponding values of EX and EP for TW = 10. For TW zz 20, there is no material difference between observed and expected effectiveness values, except in the vicinity of EP = EPM. For this test collection, the structure imposed on the documents by a subject representation is only marginally superior to a random structure with respect to recovering the structure anticipated by the relevance evaluations.

6. DISCUSSION

Results associated with the LS Test Collection and presented here reveal systematic patterns in the relationships EP(TW, TS), EPM(TW), and EH(TW). These patterns have been recently confirmed in a larger and more diversified test collection [20,21]. The CF Test Collection consists of 1,239 papers published in the years 1974 through 1979 and indexed with the term CYSTIC FIBROSIS in the National Library of Medicines’s MEDLINE file, and 100 queries with three sets of exhaustive relevance evaluations from subject experts. The test collection includes four subject representations and two citation representations: 1. The MEDLINE terms representing the major subjects of the paper, a. without subject subheadings and b. with subject subheadings; 2. the complete set of MEDLINE index terms representing the major and minor subjects of the paper, a. without subject subheadings and b. with subject subheadings; 3. the complete set of references appearing in the paper; and 4. a comprehensive set of citations to the paper, as indexed in the Science Citation Index (SCI)/DIALOG files. Patterns in the relationships EP(TW, TS), EPM(TW), and EH(TW) for the four subject representations and two citation representations are essentially the same as those shown in Fig. 2, and as shown in Table 2, the magnitude of the results for EPO and the magnitude of the results for EHO, the optimal EH value, are similar to those associated

Table 2. LS and CF test collection results Effectiveness Values Test collection/ representation LS test collection/ FASIT subject concepts CF test collection/ MEDLINE: Major subjects MEDLINE: Major subjects & subheadings MEDLINE: All subjects MEDLINE: All subjects & subheadings References Citations

EHO

EPO

0.59

0.66

0.64 0.59 0.60 0.57 0.56 0.56

0.69 0.65 0.69 0.68 0.67 0.65

W.M.

346

SHAW, JR.

with the LS Test Collection. Material differences between EHO values occur in the CF Test Collection when the highest performance level, associated with the reference and citation representations, and the second highest performance level, associated with the representation consisting of all subjects with subject subheadings, are compared to the lowest performance level, associated with the major-subject representation. There is no material difference between any EPO values, including those associated with FASIT subject concepts. Results presented in Table 2 suggest that alternative document representations produce similar levels of retrieval performance when they are evaluated in an equivalent manner [22]. Implications associated with the RGH and shown in the relationship EX(TW, TS) provide a tentative explanation for a class of cluster-based retrieval results. Based on an investigation of several clustering strategies and several test collections, it has been concluded that the best retrieval results are generated by clustering strategies that produce many small clusters, especially clusters consisting of only two documents [17]. Whether the clustering strategies reveal a clustering tendency in the data associated with relevance or merely produce results that are an artifact of the technique is unclear. It is clear, however, that a list of randomly selected document pairs will include many more relevant-nonrelevant associations than relevant-relevant associations, with respect to any given query. Under these circumstances, no clustering technique can be expected to connect documents relevant to the same query. Clusters based on pairwise associations that are unrelated to the relevance criterion will include few documents that are relevant to a particular query, when the cluster is small compared to the size of the collection. If such a cluster is retrieved, recall will be low because few relevant documents are retrieved, and precision will be low because many nonrelevant documents are retrieved. Retrieval performance can be artificially improved by selecting the smallest allowable cluster that contains at least one relevant document, because precision is improved with little sacrifice in recall. Recall can be increased by retrieving many small clusters, but if the clusters consist of two documents, only one of which is relevant, the cluster criterion can be abandoned. In the present investigation of FASIT subject concepts, the random assumption provides a good explanation for the observed results, and the effect of the number and size of clusters can be tested. Figure 3 shows the number of clusters with two or more documents as a function of TS for fixed values of TW. For each level of exhaustivity, the dis-

Fig. 3. Number

of clusters

as a function

of term-weight

(TW) and similarity

(KS) thresholds.

An investigation of document structures

347

tribution is unimodal and the maximum number of clusters occurs at the TS value for which EP = EPM. That is, for each level of exhaustivity, the partition for which clusterbased retrieval performance is maximized occurs when the number of clusters is maximized and the average number of documents per cluster is minimized. Moreover, it can be seen that the maximum number of clusters increases as TW is increased from TW = 0 to TW = 30, where EPM = EPO, and decreases thereafter. The curves shown in Fig. 3 are essentially the mirror images of the corresponding curves for EP( TW, TS) shown in Fig. 2; the similarity scales in Figs. 2 and 3 are reversed to minimize overlapping data points. The relative retrieval performance can be predicted directly from the results for the number of clusters in partitions of the document collection when there is no clustering structure in the data or when the clustering structure is unrelated to the relevance structure. If the results associated with cluster-based retrieval can be explained by assuming that the underlying structures are the result of a random process, the problem cannot be attributed to the clustering criterion or the retrieval strategy nor can it necessarily be attributed to FASIT in particular or automatic indexing in general. The fundamental problem might be associated with the assumption that the complex notion of relevance is embodied exclusively in subject representations of documents [23]. The appropriate subject might be necessary to ensure a successful retrieval outcome, but not sufficient. 7. CONCLUSION

The presence of clustering structure in the LS Test Collection has been investigated as a function of the exhaustivity of the subject representation. Results show that the evidence for clustering structure varies systematically as a function of a term-weight threshold, which controls the exhaustivity of the representation. There is clear evidence for the presence of clustering structure at the most exhaustive and least exhaustive representations. Intermediate values of the term-weight threshold produce the least evidence of clustering structure in the document collection. The relationship between the success of cluster-based retrieval and the presence of clustering structure has also been investigated. Results show that exhaustivity levels with the least evidence of clustering structure produce the most successful retrieval results. This unexpected outcome can be explained by assuming that the clustering structure manifest in the data is unrelated to the structure imposed on the documents by a set of queries and relevance evaluations. It is shown that observed levels of cluster-based retrieval performance can be explained by assuming that pairwise associations between documents are selected randomly. As expected under the random graph hypothesis, it is shown that levels of retrieval performance are inversely related to the size of clusters in the clustering outcome and that relative retrieval performance can be predicted by the number and average size of clusters. Patterns in observed performance levels computed from the LS Test Collection have been confirmed in the CF test collection, which includes four subject representatio is based on MEDLINE index terms, and two citation representations. In general, cluster-based retrieval results characterized by low performance levels and a reliance on small clusters, might signify that there is no clustering structure in the data or that the clustering structure is unrelated to the relevance criterion [ 17,24-271. If the presence of random structure is confirmed in the CF Test Collection and in other test collections, the capacity of subject representations alone to associate documents relevant to the same query and discriminate between documents relevant to different queries must be questioned.

REFERENCES 1. Dubes, R.; Jain, AK. Clustering methodologies in exploratory data analysis. In: Yovits, M.C., editor. Advances in Computers, 19: 113-228; 1980. 2. Dubes, R.; Jain, A.K. Validity Studies in Clustering Methodologies. Pattern Recognition, 11: 235-254; 1979. 3. Van Rijsbergen, C.J. Information Retrieval. London: Butterworths; 1979. 4. Shaw, W.M., Jr. An investigation of document partitions. Information Processing and Management, 22(l): 19-28: 1986.

348

W.M.

Sruw,

JR.

5. Dillon, M.; Gray, A.S. FASIT: A fully automatic syntactically based indexing system. Journal of the American Society for Information Science, 34(2): 99-108; 1983. 6. Dillon, M.; McDonald, L.K. Fully automatic book indexing. Journal of Documentation, 39(3): 135-154; 1983. 7. Salton, G.; McGill, M.J. Introduction to modern information retrieval. New York: McGraw-Hill; 1983. 8. Harary. F. Graph theory. Reading, MA: Addison-Wesley; 1969. 9. Harary, F.; Palmer, E.M. Graphical enumeration. New York: Academic Press; 1973. 10. Karonski, M. A review of random graphs. Journal of Graph Theory, 6: 349-389; 1982. 1 I. Schultz, J.V.; Hubert, L.J. Data analysis and the connectivity of random graphs. Journal of hlathematical Psychology, 10: 421-428; 1973. 12. Ogilvie, J.C. The distribution of number and size of connected components in random graphs of medium size. In: Morrell, A.J.H., editor. Proceedings of the International Federation for Information Processing, 68: 1527-1530; 1969. 13. Ling, R.F. An exact probability distribution on the connectivity of random graphs. Journal of Mathematical Psychology, 12: 90-98; 1975. 14. Ling, R.F.; Killough, G.G. Probability tables for cluster analysis based on a theory of random graphs. Journal of the American Statistical Association, 71(354): 293-300; 1976. 15. Willett, P. Clustering tendency in chemical classifications. Journal of Chemical Information and Computer Sciences, 25: 78-80; 1985. 16. Ling, R.F. The Expected Number of Components in Random Linear Graphs. The Annals of Probability, l(5): 876-881; 1973. 17. Griffiths, A.; Luckhurst, H.C.; Willett, P. Using interdocument similarity information in document retrieval systems. Journal of the American Society for Information Science, 37(l): 3-11; 1986. 18. Voorhees, E.M. The effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval. PhD thesis, Cornell University; 1986. 19. Sparck-Jones, K. Progress in documentation: Automatic indexing. Journal of Documentation, 39(4): 393432; 1971. 20. Wood, J.B.; Wood, R.E.; Shaw, W.M.. Jr. The cystic fibrosis data base. Technical Report No. 8902, University of North Carolina School of Information and Library Science, Chapel Hill, NC; 1989. 21. Shaw, W.M., Jr. An evaluation and comparison of subject indexing and citation indexing. Technical Report No. 8903. University of North Carolina School of Information and Library Science, Chapel Hill, NC: 1989. 22. Katzer, J.; McGill, M.J.; Tessier, J.A.; Frakes, W.; DasGupta, P. A study of the overlap among document representations. Information Technology: Research and Development, 2: 261-274; 1982. 23. Swanson, D.R. Historical note: Information retrieval and the future of an illusion. Journal of the American Society for Information Science, 39(2): 92-98; 1988. 24. Van Rijsbergen, C.J. Further experiments with hierarchic clustering in document retrieval. Information Storage and Retrieval, 10: I-14; 1974. 25. Van Rijsbergen, C.J.; Croft, W.B. Document clustering: An evaluation of some experiments with the cranfield 1400 collection. Information Processing and Management, 11: 171-182; 1975. 26. Croft, W.B. A model of cluster searching based on classification. Information Systems, 5: 189-195; 1980. 27. Willett, P. A note on the use of nearest neighbors for implementing single linkage document classifications. Journal of the American Society for Information Science, 35(3): 149-152; 1984.