Automatic text decomposition and structuring

Automatic text decomposition and structuring

information Pmcessing & Management. Vol. 32, No. 2. pp. 121-138, 1996 Copyright0 1996Ekvier Science Ltd Printedin Gnat Britain.All rights metwd 03of95...

1MB Sizes 44 Downloads 99 Views

information Pmcessing & Management. Vol. 32, No. 2. pp. 121-138, 1996 Copyright0 1996Ekvier Science Ltd Printedin Gnat Britain.All rights metwd 03of9573m $15 + 0.00

AUTOMATIC TJ3XT DECOMPOSITION AND STRUCTURING GERARD SALTO~,

De-t

JAMES ALLAN and AMIT

of ComputerScience,CornellU~v~si~,

Ithaca, NY

SINGHAL 14853-7501,U.S.A.

Abstract-Sophisticated text similarity measurements are used to determine relationships between natural-language texts and text excerpts. The resulting linked hypertext maps can he decomposed into text segments and text themes, and these decompositions are usable to identify different text types and text structures, leading to improved text access and utilization. Examples of text de~m~sition are given for expository and non-expository texts.

1. AUTOMATIC TEXT COMPARISON METHODS

The vector processing model of retrieval has been used with substantial success to m~ipula~ large collections of natural-language text. In vector processing, texts or text excerpts, as well as requests for information, are represented by sets of terms, or term vectors. Collectively the terms assigned to a given text are used to represent text content. Substantially identical methods are

usable for determining collection structure (by comparing pairs of text vectors with each other and identi@ing text pairs found to be sufficiently similar), and for retrieving information (by ~rn~~g query vectors with the vectors representing the stored items and retrieving items found to be similar to the queries). The results of a similarity computation between a query vector and the stored document vectors can be ranked in decreasing order of the computed query similarity. This makes it possible to retrieve the most important items-those most similar to the user queries- first. In vector processing, term weights are easily accommodated because vectors of weighted terms are manipulated almost as easily as binary term vectors where weights are restricted to one for assigned terms and zero for missing terms (Salton, 1971, 1991). A high-performance term weighting system assigns large weights to terms that occur frequently in particular documents, but rarely on the outside, because such terms are able to distinguish the items in which they occur from the remainder of the collection. A typical term weight of this type, known as a @Xi@ weight (term frequency times inverse collection frequency), is shown in expresion (1)

where W, n?presents the weight of term Tk assigned to document Di, & is the frequency of occurrence of term Tkin Di (a frequency of 0 is assumed for terms not assigned to DJ, N is the size of the d~ume~t collection, and n, is the number of texts in the collation with term Tk, The summ~ion in the ~norn~tor of expression (I), taken over all terms in a particular vector, is used for length normalization to insure that all documents have equal chance of being retrieved, Without length normalization, the longer documents with more assigned terms and higher term frequencies would generate higher document similarities, and exhibit higher retrieval potential, than the shorter items. Other form of tfxidfweights are known, including some with different length no~~i~tio~s, different term frequency factors based on the logon of the term fkquency, and a~t~ativ~ inverse collection frequency factors (Salton & Buckley, 1988). Given query and document vectors Qj = (Wjl,wiz, . . . , WJ and Di= (wil, w,~,, . . , w*), respectively, or alternatively given two document vectors Di and D,, a vector similarity function of the form 127

I28

Gerard Salton et ai.

will reflect the s~il~ty in term ~si~rnen~ for the co~s~~g vectors. When normaliied term weights are used, such as those of expression (l), the vector similarity lies between 0 and 1, and it depends on the proportion and the weight of matching terms in the vectors. In many automatic text retrieval systems, such as the Smart system, the terms appearing in particular vectors are words, or expressions, extracted from the corresponding document or query texts. In these circumstances, the global vector similarity measurements of expression (2) may produce misleading results when language ambiguities are not properly recognized. For example, when words with multiple meanings are included in the vectors, term matches may be produced even dour the terms may carry distinct me~gs in the respective vectors. This effect exaggerates the size of the corresponding similarity rne~~rn~n~. ~~ogously, term meanings may be synonymous, although the particular word forms included in the vectors are not identical. In such a case, a valid term match will not be obtained, and the similarity measure is then understated. This suggests that the global term similarity computation of expression (2) must be supplemented by additional operations designed to check the local contexts in which the terms are used in the texts under consideration. When these contexts are recognized as identical, the global vector similarity measures are accepted as reasonable indications of text relationship. Otherwise they are not. In the Smart environment, a dual search strategy is usable in which the global vector matching operations are supplemented by local vector similarity operations. Operationally, this means that an attempt is made to detect locally matching text fragments, such as text sentences, or paragraphs included in the documents under consideration. The local text match is then used as a filter affecting the results of the global text comparison operation. That is, a sufficiently high global text similarity between a query and a document, or between two documents, is accepted as correct only if the respective texts also contain sufficiently similar local text structures-eg. at least one matc~g sentence pair with sufficiently high sentence pair sailor. The assumption is that when the local contexts coincide, so do the word meanings, and the global similarity is then accepted as indicative of a true text similarity. Otherwise, the global text similarity measutement is rejected. Local text similarities are derived by vector matching methods similar to those described earlier for full texts: a term vector is constructed for each local text excerpt, that is each sentence and paragraph in the documents of interest, and the vector comparison formulas of expressions (1) and (2) are now applied to local text fragments rather than full d~uments (Salton & Buckley, 1991a, b). In earlier experiments, conducted with an electronic version of the Funk and Wagnalls New Encyclopedia (25,000 encyclopedia articles, about 60 megabyte of text), it was found that the dual global-local text operation was effective in detecting most of the obvious retrieval errors that were due to polysemies and related language ambiguities. For example, using only global text matching methods, it is impossible to reject documents about Anthony M. Kennedy (the Supreme Court Justice) in answer to questions about John F. Kennedy (the former US. President), because coincidences in the background of the two personalities-the identical name, their Tucson at Harvard Unive~i~, their high status in the U.S. gove~ent, etc.produce high global vector similarities. However, the specific local environment of the two people is very different, and no matching local structures are detected. Hence, the items about Anthony M. Kennedy can safely be rejected when information is wanted about John E Kennedy. An evaluation based on 359 searches carried out in the Funk and Wagnalls Encyclopedia indicates that the best context check using local paragraph matches provides an advantage of 12.5% in average retrieval precision over the global vector match alone, while the best local sentence comp~sons furnish improvements of 13.4% (Salton et af., 1994a). Table 1 shows retrieval results based on 74 searches conducted with a large collation of over 46,000 documents taken from the Federal Register (over 400 megabytes of text). The 74 searches were used earlier as part of the TREC experiments (Harman, 1995). The results for the unrestricted global vector match using the term weighting system of expression (1) for both

Automatic text decomposition and structuring

129

Table 1. Evaluation of g~~-l~al-~xt comparisons. Federal Register Collection: 46,164 documents; 74 queries [term weights for both query and document term use rfXi# weights specified by expression (I)] Restricted searches m#ring at least one matching sentence with at least 3 matching Mms and sentence similarity above threshold

0uestricte.d search run (no required sentence match)

30

35

40

50

14,800

7972

6927

5872

3979

Total No. of relevant retrieved items

516

547

535

501

381

1l-point search precision

0.1364

0.1814 + 33.0%

0.1834 + 34.5%

0.1819 + 33.4%

0.1760 f 29.0%

Total No. of retrieved items

query and document terms (known as ntc-ntc weight~g) is given in column 1 of Table 1, while the ~rn~~g columns of Table 1 contain retrieval results for various restricted search runs requiring at least one matching sentence pair between the query and each retrieved item, with at least 3 matching terms in the matching sentences, and a sentence similarity threshold varying between 30.0 and 50.0 for the four restricted runs. Because similarities between very short sentences tend to be high when normalized term weights such as those of expression (1) are used, the sentence comparisons are based on non-normalized term weights [i.e. expression (1) is used without the normalizing factor in the denominator]. This explains why the minimum required sentence similarity (between 30 and 50) is much larger than 1. The non-normalized similarity computation gives preference to longer matching sentences which are more indicative of coincidences in text meaning. As Table 1 shows, the 1l-point average precision (the average precision computed at recall levels ranging from 0 to 1 in steps of 0.10) improves from 0.1364 for the tutrestricted global searches to 0.1834 when local sentence matches above a similarity threshold of 35.0 are mquired. This represents an advantage of 34.5%. The local sentence match~g system is further characterized by the data in rows 1 and 2 of Table 1. The total munber of retrieved items for the 74 queries decreases drastically as the required similarity in local contexts becomes more dern~~g, but the number of relevant retrieved items decreases not at all for the better sentence match restrictions. Thus, for the best case, the total number of retrieved items decreases by 60% (from 14,800 to 6927), while the number of relevant retrieved items actually increased by 4% (from 516 to 535). This illustrates the effect of the local text comparisons as a precision fiber capable of eliminating large numbers of extraneous items (Salton & AlIan, 1994). In the TREC environment, the most effective form of tfXiu” tetm weighting uses the logarithm of the term frequency, rather than the raw term frequency as in expression (1). for both query and document terms. Furthermore, the inverse collection frequency factor [log N/n, in expression (l)] is applied only to the query terms, but not the document terms. The resulting term weighting system, known as mc-ltc provides a higher performance standard for unrestricted global vector matches than that shown in column 1 of Table 1 (0.2340 average precision instead of 0.1364). The local search restrictions offer additional improvements of 5 or 6% over the higher baseline, corresponding to an average search precision of about 0.25.

All na~-~~~age texts, no matter what the aced aim and fiction, exhibit a mcognkable internal structure. For example, scientific articles often start with a summary, followed by introductory materials, a development of the main topics, and a conclusion or

130

GerardSalton et 01.

9

Fig. 1. lightly linked paragraphstmcture(encyclopediaarticle 16585Viscount Horatio Nelson).

review of the material covered. Different text types may be structured differently, but in general, text authors will insure comprehensibility by following requirements of clarity, coherence, and comprehensiveness. It should therefore be possible to determine the structural characteristics of texts, or documents, and to recognize the principal constituent pieces that make up the complete texts. Such a structural decomposition might then be used as a basis for the design of superior methods for text retrieval, text summarization, and selective text traversal. Two types of text decomposition are of special interest: a chronological decomposition into contiguous text pieces that carry a particular function in the documents under consideration. Such contiguous text pieces, known as text segments, may typically cover introductory developments, elaborations and examples, and concluding or summarizing materials. A chronological text segmentation into functional text units could be useful for text summarization by making it possible to assemble the most important text pieces from each segment. In addition to the chronological text segmentation, it is also useful to identify semantically homogeneous text pieces, known as text themes. A theme may be defined as a set of not necessarily contiguous text excerpts covering a common subject matter. Because of the semantic homogeneity of text themes, a specific, narrow search request may be answered best by retrieving the text excerpts identifying particular themes, rather than retrieving full document texts, or contiguous text segments. The global-local text comparison methods introduced in the previous section can be used as a basis for the structural and semantic decomposition of individual texts or groups of related texts. In particular, individual text pieces, such as text paragraphs or text sentences, can be compared, and pieces of text with sufficiently high text similarities can be related, or linked in a text relationship map (Salton & Allan, 1993). The structure of the relationship maps, and the decomposition into segments and themes, can then furnish important information about the type of text under consideration and the way in which various text types should be processed. Consider first some examples of well-behaved, expository texts, that is, texts that tell a story in a straightforward way. Figure 1 is a paragraph relationship map for an article published in the Funk and Wagnalls encyclopedia entitled, “Viscount Horatio Nelson” (encyclopedia article 16585) (Funk & Wagnalls, 1979). The individual text paragraphs are shown as nodes of a graph, and the lines (or links) between paragraphs indicate a pairwise paragraph similarity exceeding

Automatic text decompositionand structuring

131

a stated threshold (0.20 for document 16585). The highly linked structure of Fig. 1 shows that the text of the sample article is quite h~ogeneo~: most paragraphs cover related subject matter, indicated by the fact that many baths are related to other p~~aphs. This implies that the text should be easy to read, or ~tematively, that the essence of the text content may be ascertained by considering only a small number of text excerpts. The text homogeneity is illustrated by verifying the content of the paragraphs linked to some sample paragraph, such as paragraph 3 of document 16585 (16585.~3). as shown in Table 2. As the text indicates everything revolves around the activities of Horatio Nelson, which largely consisted of engagements with the French fleet. Since a text segment is a contiguous text piece that fulfills a particular role in the given text, one expects a substantial amount of linking between text excerpts within a particular segment. On the other hand, the linking to adjacent text segments may be sparse because these adjacent segments have a different function, and hence possibly different vocabularies and structure. The text segmentation then consists of finding gaps in the linking pattern between adjacent pieces of text. The example of Fig. 1 shows that for homogeneous maps such as those for article 16585, it is difficult to decompose the text into separable pieces. Thus, the sample text appears to consist of a single segment encompassing all paragraphs. Analogously there is only one theme characterized by the phrase “Horatio Nelson”. A much different text-s~cture is revealed by the sparse linking structure of Fig. 2, showing the paragraph connection pattern for document 19829 History of Rome. Here different largely unrelated topics are treated in the various paragraphs, and some clearly separable text pieces are evident. For example, the set of paragraphs between 19829.~5 and 19829.~12 is linked internally but disconnected from the adjacent material This text area may then be identified as a segment. The same is true of the area between paragraphs 14 and 23, and 50 and 52. The subjects covered by the several segments are shown in Fig. 2 along the periphery of the paragraph relationship map. For a relatively disconnected map such as that of Fig. 2, the main themes may be expected to be identical with the topics covered by the segments. One exception in the example of Fig. 2 may he the triangle formed by the three mutually linked paragraphs 19829.p6-pl4-~22. This triangle contains paragraphs from at least two different segments. The corresponding text treats various interactions between the Remans and the Etruscans that took place over a lengthy period in Roman history. If text themes are initially defined as groups of mutually related text excerpts, (Salton et al., 1994b) the previously mentioned triangle represents a new theme, not obtainable by the text ~~e~~tion. Consider as a third example the text relationship map of Fig. 3 for en~ylopedia article 859 entitled “AmeriW~ Revolution”. In this case, the linking pattern is not as dense as that in Fig. 1, nor as sparse as that in Fig. 2, and the segments cover various episodes of the revolutions war. Two areas of concentration are in evidence in Fig. 3. The beginning of the text from

Table 2. Contents of paragraphsrelatedto 16585.~3 (HoratioNelson) Paragraph

Basic contents

16585.~3

Horatio, Viscount Nelson (1758-1805) British naval commander Nelson was born in Bunham Thorpe, Norfolk, on September29, 1758 Battle of the Nile, August 1798, Nelson destroyed most of Frenchvessels In 1801, Nelson became Vice-Admiral. Ahhough second in command, Nelson assumed leadership of the British &et in the battIeof Copenha8en In the battIe of Trafalgar in 1805, Nelson overwhe~n~y defeated the combiid French and Spanish &et. He was mortally wounded by a French sharpshooteras the battIeended Nelson is regarded as the most famous of aI1 British nayal Ieaders

16585.p4 1’6585P8 16585p9 16585.~11

16585.~12

132

Gerardsalton et al.

canpon ms29 I9829 Unbbrlov020&wred Fig. 2. Sparse paragraph connection pattern (encyclopedia article 19829 History of Rome).

Fig. 3. Text relationship map (encyclopedia azticle 859 American Revolution),

paragraphs 3 to 19 covers the causes of the American revolution, as well as the British attempts to quiet the rebellion in Massachusetts. The other dense area betweeen paragraphs 29 and 48

133

Automatictext decompwition and structuring

deals with the military engagements taking place during 1777 and 1778. The material coMe~ing these dense areas ~~phs 21 and 27) covers the early military engagements in Lexington and Concord, and the final paragraphs (paragraphs 56 and 60) discribe the peace negotiations and the treaty of Paris. In the map of Fig. 3 there are few links crossing segment boundaries. In these circumstances, the themes often represent sets of adjacent segments. There are two obvious themes, covering respectively the preliminaries of the revolution from the beginning of the text until about paragraph 2 1, and the actual war operations starting with paragraph 23 and ending at paragraph 60. For the expository texts used as examples up to now, there is a recognizable story line with distinct episodes, and the text themes seem to correspond closely to individual text segments, or a theme will cover several adjacent text segments. The theme subdivision will then present a simplified picture of the document structure where only the major topic shifts are in evidence. The structural characteristics of more complex document types is considered in the next section.

3. COMPLEX TEXT TYPES

Many texts exist in the literature which do not follow a clear line of reasoning. This is the case notably for non-expository texts consisting of notes, messages, or listing of various kinds, where no obvious relationships may exist between adjacent pieces of text, and the text structure is not easily discovered by simply traversing the text in chronological order. In such circumstances the ~lationship between text segments and themes may be much more complicated than for normal expository text. Consider as an example the text relationship map of Fig. 4 representing the connection pattern for a story originally appearing in the Wall Street Journal under the title “Cajun Convention: A Confederacy of Doozies” (Wall Street Journal article 20216). This article describes certain happenings taking place at a New Orleans convention, and in view of the sparseness of the

cimqllwe v&l J

Fig. 4; Text ~i~i~~p

[email protected] 10110

map (Cajunconvention, WaffStreetJournafArticle20210).

GerardSalton er of.

Fig. 5. Non-expositorytext (FederalRegisterDocument 3579) FederalDrugA~~s~tian: ment of av~labiIi~ and list of designatedorphandrugs end biologicals.

aunounce-

intercormections, the activities discussed at different points in the text appear to be quite unrelated. In fact, the item is a chronological account of different convention activities, A few segments are easily recognizable: thus the segment between paragraphs 2 and 9 is an introduction with other segments discussing activities taking place on Thursday afternoon of the convention week (paragraphs 10 to 13), Friday morning (paragraphs 18 and 19), Friday afternoon (paragraphs 23 and 24), Saturday morning (paragraphs 25-28), and Saturday afternoon (paragraphs 34-37). Since the segments are nearly disjoint, any unifying content must be picked up through the themes. Assuming once again that themes are identified by triangles, or higher-order mutually related structures, two recognizable themes are provided by the linked paragraphs set p3, p16 and ~26, as well as by ~18, ~19, and p42. The first theme contains text excerpts from three different segments, with the common topic “Bourbon Street”, whereas the latter theme is entitled, “George Bush’s minions”. Theme one reflects the fact that various happenings at the convention took place in Bourbon Street, while the second theme refers to certain, political activities related to the convention. As the example shows, for poorly structured texts such as that of Fig. 4, the text segmentation and theme decomposition provide quite different types of information. Figure 5 shows the connection pattern for a notice issued by the U.S. Food and Drug A~~is~ation announcing the availability of a number of drugs designated as orphan drugs (~vestig~~~ drugs that have not yet been approved for m~keting). This article (Federal Register document 3579) consists of a brief intuition, followed by a list of over 200 short paragraphs, each listing a particular drug or biologic~ product together with an explanation of the function of the drug, and the name and address of the m~ufact~r. The map of Fig. 5 shows that there are few areas with subs~ti~ connations between adjacent paragraphs. Exceptions are the introductory paragraphs (p&p1 l), and a few areasbetween paragraphs 45 and 49,65 and 75,102 and 104,159 and 161,185 and 188, and 216 and 218. By and large, the orphan drugs appear to be listed in arbitrary order with little overlap or relationship between them. In this case, the segmentation is uninteresting, since it provides a large number of very small segments

Automatic text decompositionand structuring

Fig. 6. Non-expository text (FederalRegisterDocument42132).

whose function is not clearly defined. A few themes appear in Fig. 5, identified by cross-hatching. Three themes are immediately recognizable: l

l l

paragraphs 8-10 and 60: orphan drugs and biological designators paragraphs 216-218 and 68 and 120: trade not established treatment of AIDS paragraphs 159-161 and 82: treatment of pneumocystis carinis pneumonia (PCP is a side effect of AIDS)

Since document 3579 represents a list of drugs, the relationship between the various paragraphs must be provided by identities in the drug names, names of drug manufacturers, or by similarities in dnrg applicability. Examples of the latter are provided by the themes highlighted in Fig. 5. The example of Fig. 5 shows that for documents with uncertain segmentation properties, consisting of largely unrelated text sections, the text themes provides a much more interesting content description than the text segmentation. Consider as a final example, the map of Fig: 6 representing a legal document announcing the availability of certain investigational drugs. More specifically, the text consists of a summary,, followed by highlights of the program, detailed contents of the announcement, and finally a list of amendments to existing rules relating to investigational drugs. The minimum link similarity used for the map of Fig. 6 is a relatively high 0.36, and the high map density shows that a large amount of repetition exists in the text under consideration. In fact, the set of amendments to the rule starting at paragraph 101 is almonst entirely duplicative because the amendments effectively repeat the earlier text with appropriate minor changes. A clewr view of the text structure is obtained from Fig. 7 which is a repeat of the earlier figure computed with a mfnimal link similarity of 0.60. Once again, the segmentational process produces mostly single paragraph segments unrelated to neighboring paragraphs. The themes provide a better picture of text content. For example, the cross-hatched theme that include paragraphs 6,9,44,45,7 1,105, 113, and 115 is entitled, “drugs intended to treat life-threatening or sevemly-debilitating illnesses”. Information about this topic appears in the summary, the content section of the program, and again as part of the list of amendments. Analogously,

136

Gerard Salton et al.

t.

;

421

4

Fig. 7. Map of Fig. 6 computed for link threshold 0,60 (Federal Register Document 42132).

paragraphs 27, 61, and 127 form a theme dealing with a risk-benefit analysis designed to establish whether the benefits inherent in approving a drug outweigh the known and potential risks.

\.

,~,,.~,~,..~,,,.. 1641J...N~alccwI

~

~

.~

~

~.,.,.~,,~,~,6,,,,~,,,,,,,~ Uab ~

0,40 ~mwtil

Fig, 8. Parallel text smwture (encyclopedia article 16412 Napoleon I, 16416 Napoleonic Wars).

Fig. 9. Clusteringof retrieveditems (* relevantdocuments)in responseto query QlO (AIDS ueatment). Nonrelevant retrieved items: 1740, 2717, 19156, 23008, 23556 (notices of meetings of National Commission on AIDS; extension of application due date, etc.) relevant retrieved items: 3579, 9698, 28413 (availability of lists of designatedorphandrugs and biologicals).

A study of text structure is useful in many areas of application. In particular, it appears possible to distinguish straightforward, expository texts with a simple story line from more complicated text structures by examining the relationship between the chronological text segmentation and the semantic theme identification. In the former case, simple text retrieval and text traversal methods may be based on an identification of the more important text segments, and a c~nolo~c~ text traversal starting at the eggnog of the text. For texts with more complic~ structures, a text theme i&ntification appears essential, and the text must be traversal selectively rather than exhaustively. In an information retrieval environment, useful passage retrieval techniques can be implemented which lead to the retrieval of particular text segments or themes in answer to individual queries, More often than not, a properly chosen text passage will constitute a better answer to an available query than the corresponding full document (Salton & Allan, 1993). Furthermore, imInoved retrieval results are available by using computed relationships between sets of retrieved d~urnen~. Consider, as an example, two items retrieved in response to a query about Napolean (encyclopedia documents 16412 entitled “Napolean I”, and 16416 covering the Napoleonic Wars). Figure 8 is a text relationship map between groups of three adjacent sentences (sgroups) in the two documents. ‘Ihe connection pattern of Fig. 8 reveals a parallel structure between documents 16412 and 16416 in the sense that text fragments occurring early in article 16412 are linked to fragments occurring early in 16416 (link l6412.s34-16416.~22); fragments in the text centers are similarly linked ( l6412.s43- 164l6.s46), as well as excerpts located at the end of the texts (l~l2.s77-1~16.~83). Both documents in fact trace the life and battles of Napoleon from the Austrian defeat at Marengo (early), to the French campaign in Italy under Massena (middle), and the eventual exile of Napoleon in Elba (late). In that case, the common text portions might be treated as especially important for text retrieval or text summarization purposes (Salton et al.,

138

Gerardsalton et 01.

1994b). Text relationship computations between full-text documents rather than between text excerpts, may also prove helpful in retrieval. Figure 9 is a ~~tio~hip map for eight full-text items retrieved in response to one of the queries used in the well-known TRW evaluation studies (query 10 dealing with AIDS agents) (Harman, 1995). Three of the eight items (documents 3579,9698, and 28413) were rated as relevant to the query, the remainder being non-relevant. It may be noted that the three relevant items are mutually connected, and so are the five non-relevant ones. When the relevant and non-relevant items retrieved in response to a given query form mutually disjoint clusters, as they do in the example of Fig. 9, query optimization techniques such as relevancefeedback are usable, capable of retrieving new items similar to previously obtained materials designated as relevant. By the same token, items not similar to those designed as relevant can be rejected (Rocchio, 197 1). The examples used in this study make it clear that the text decomposition procedures now in place, can produce detailed representations of text relationships both within particular dbcuments, and between different texts. The structural decompositions are useful in various applications, such as determining text themes, retrieving wanted text excerpts, and traversing texts selectively in accordance with individual user needs. Sophisticated automatic aids are currently being designed providing flexible access to large full-text collections with arbitrary subject matter and coverage.

Ac~now~e~ge~n~~~is

study was supportedin partby the NationaIScience Founda~onundergrantIRI 9300124. An eari& version of this paper was presentedat the RIAO ‘94 Con~rence at the Rockefeller University,New York, 11-13 octobez 1994.

REFERENCES Funk and Wagnalls New Encyclopedia (1979). Funk and Wagnalls, New York, 29 volumes, 25,000 encyclopedia articles. Harman. D. K. (Ed.) (1995). In Proceedings of the Third Text Retrieval Conferences (TREC3), NIST Special Publication500-215. National Instituteof Science and Technology.Gaithersburg,Md. Rocchio, J. J., Jr. (1971). Relevance feedback in information retrieval. In 0. Salton (Rd.), The Smarf systemExperimentsin automatic documentprocessing. Englewood Cliffs, N.J.: Prentice-Hall. Salton. G. (Ed.) (1971). The Smart retrieval system-Experiments in auromaticdocumentprocessinlq. Cliffs, _ Englewood _ N.J.; &&-Hall: Salton, G. (1991). Developments in automatictext retrieval.Science. 253(5023), 974-980. Salton. G.. & Allan, J. (1993). Selective text utilization and text traversal.In Proceedings of Hypertext-93. New York: Ass&a&m for ComputingMachinery. Salton, G., & Allan, J. (1994). Text retrievalusing the vector processing model, Technical Report, ComputerScience Department,Cornell University, Ithaca, N.Y. Salton. G., & Buckley, C. (1988). Term weighting approachesin automatictext retrieval.Information Processing & Management,24(f), 513-523. Salto~, G,, & Buckky, C. (1991a). Automatic text structuringand retrieval-Experiments in automaticencyclopedia searching. In Proceedings of the 14th lnter~tio~l A~M~SIGIR Conference on Research and ~eveio~nt in ~~~ti~ Retrieval (pp. 21-30. New York:Association for ComputingMachinery. Salton, G., &.Buckley, C. (1991b). Giobal text rn~c~ng for astir retrieval.Science, 253, 1012-1015. S&on, G., A@uI, J., & 3uckley, C. (1994a). Auto~tic ~c~g and retrievalof large text files. Co~unic~ons of tkeACM, 37(2), 97-108. wton, G.. Allan, J., Buckley, C., & Singhal, A. (1994b). Automatic analysis, theme generation,and sum~~ti~ of niachine-readabletexts. Science, 264, 1421-26.