Looking in text windows: Their size and composition

Looking in text windows: Their size and composition

lnformatmn Processing & Management, Vol. 30, No. 5, pp. 619-629, 1994 Copyright 0 1994Elsevier Science Ltd Pergamon Printed in Great Britain. All...

1MB Sizes 0 Downloads 25 Views

lnformatmn Processing & Management, Vol. 30, No. 5, pp. 619-629, 1994 Copyright 0 1994Elsevier Science Ltd

Pergamon

Printed

in Great

Britain.

All

0306-4573/94

rights

reserved

$6.00

+

.OO

0306-4573(93)E0006-A

LOOKING IN TEXT WINDOWS: THEIR SIZE AND COMPOSITION STEPHANIE W. HAAS and ROBERT M. LOSEE, JR. School of Information and Library Science, CB# 3360, 100 Manning University (Recerved

of North

Carolina,

Chapel

22 June 1993; accepted

Hill, NC 27599-3360,

in final form

18 October

Hall,

U.S.A. 1993)

text window is a group of words appearing in contiguous positions in text. Intuitively, words in such close proximity should have something to do with each other.

Abstract--A

We can use the text window to exploit a variety of lexical, syntactic, and semantic relationships without having to analyze the text explicitly for their structure. This research supports the previously suggested idea that natural groupings of words are best treated as a unit of size 7 to 11 words, that is, plus or minus three to five words. Our text retrieval experiments varying the size of windows, both with full text and with stopwords removed, support these size ranges. The characteristics of windows that best match terms in queries are examined in detail, revealing interesting differences between those for queries with good results and those for queries with poorer results. Queries with good results tend to contain more content word phrases and fewer terms with high frequency of use in the database. Information retrieval systems may benefit from expanding thesaurusstyle relationships or incorporating statistical dependencies for terms within these windows.

1. INTRODUCTION

The idea of a “window” or span of words appearing in contiguous positions in text has been useful in studying the nature and degree of relationships between words. Intuitively, words that appear in close proximity should have more to do with each other, lexically, syntactically, and semantically, than words appearing at a distance. The strength of this relationship and the distance at which it occurs has been studied in several contexts. Accepting the notion of the window allows the exploitation of the existence of these relationships without having to analyze the text explicitly for details about their internal structure. The focus of this research is on the size of the windows or spans and the effects of varying the window size on information retrieval performance. Windows are extracted from documents and the number of words in the windows in common with the query is used as a measure of similarity between the documents and the query. The changes in performance as the window size is changed and as the number of matching words varies are used to study the characteristics of windows and window size. An analysis of the words appearing in the windows reveals some interesting differences between those produced by queries with good results and those with poorer results. The notion of a window or span deserves some attention. In some research, a window is viewed as the words surrounding a target word, and is thus described as the target plus or minus some number of words. Other research uses the notion of a span of words without the idea of having a center from which the window was constructed. The research described here falls into the latter category, but we believe the results are applicable for either case.

2. BACKGROUND

Evidence from text-based research indicates that there are a variety of linguistic or textual relationships that either appear within a window or limited span of words, or appear at their strongest within such a window. These syntactic or semantic relationships include 619

620

S.W. HAAS and R.M.

LOSEE, JR.

lexical, phrasal, co-occurrence, and dependency relationships. If the limiting span of words is of roughly the same size for all of the relationships, it may be possible to exploit them by using the window itself as a source of information, rather than separately interpreting each of the relationships and finding some way to combine the results. In this section, we review the evidence for window size provided by previous research. Research in a variety of linguistic settings indicates that most lexical relationships between words probably appear in a window size ranging from plus or minus three to plus or minus five words (Martin et al., 1983). This range was confirmed by Phillips (1985) for collocational information. In another collocation study, Church and Hanks (1990) used a similar window size; however, they examined only the five words occurring after the target word. They used a measure of association based on mutual information to find “interesting” word pairs. The “interesting” pairs showed a variety of lexical, syntactic, and semantic relationships. Maarek and Smadja (1989) used a IO-word window in building index phrases for retrieval of UNIX manual pages. They used Smadja’s EXTRACT (Smadja, 1989) to identify all pairs of content words (open class words) in a window of plus or minus five word tokens from within the documentation. Phrases were chosen to index the documentation from these pairs, based on their mutual information values. Note that all of these studies looked only at word pairs that could be found in the window span, not at larger combinations. Other work using window size as a parameter focussed on a window span of size seven (plus or minus three) as being especially useful, based on its performance in the following studies. Haas and He (1993) experimented with identifying members of a sublanguage (SL) vocabulary by looking in the immediate vicinity of words known to be in the SL. The goal in this research was to extract the SL vocabulary words while filtering out the general language words. The window was centered around a known SL word, and the other words in the window were considered as possible SL words. They used a window size of plus or minus three in their initial attempts, but found evidence that the best size would vary depending on the characteristics of the individual SL. This procedure gets much of its power from syntactic structures such as noun-noun modification. Those SLs that contained many contiguous SL words, as in the noun phrase computer communication network performance analysis, would yield good SL vocabulary extraction results with the same or slightly larger window sizes (plus or minus three or four). For SLs where the technical vocabulary occurred interspersed with general words, a smaller window size, perhaps plus or minus one or two, would give better results. If nothing was previously known about the SL, or if it did not have the long noun phrases mentioned above, plus or minus three could be useful as a default window size. A window size of plus or minus three has also been shown effective in document retrieval experiments. The size of a window can be understood as placing a limit on the set of terms having strong statistical relationships among themselves. Applications that require the estimation of a text string’s probability of occurrence may assume statistical independence of terms, may incorporate all relevant probabilities and compute an exact probability, or may use only the dependence information assumed to be important, that is, about the presumed important probabilities of relationships that exist between terms in windows. Binary term occurrences in a text string may be used in estimating probabilities assuming full dependence information, or may be used in computing limited amounts of dependence information by using the Bahadur Lazarsfeld Expansion (BLE), where dependencies are computed for groups of one to n terms, with n less than or equal to the length of the text string whose probability is being estimated. Incorporating large degrees of dependence when estimating a probability increases computation time; applications may be made faster by decreasing the degree of dependence, n, or limiting the window size in which probabilities are estimated. Using the Bahadur Lazarsfeld Expansion, Losee (1994) found that a probabilistic information retrieval system’s performance increased most rapidly for n of two or three as the window size was increased to plus or minus three terms. Incorporating dependence information from further increases in window size resulted in less rapid increases in retrieval performance.

Looking in text 3. MOTIVATION

FOR USING WINDOWS

621

windows

FOR RETRIEVAL

APPLICATIONS

The convergence on a window size of plus or minus three shown by the two preceding studies is especially interesting, because the work started from different assumptions and had different goals. It is also interesting to note how similar this size is to that found in the previously described research. This suggests the existence of a window within whose size range the various relationships (lexical, dependence, co-occurrence, etc.) are at their strongest. From a theoretical perspective, this would indicate that the notion of a window, whether centered around a particular word or not, captures several different types of relationships and conflates them into a single useful object. Utilizing the window as an indicator of these relationships does not give the same detailed information as actually analyzing the text for the various relationships, but allows their presence to be exploited, with very practical results. It provides a principled limit to be placed on any type of analysis that uses distance as a component. As Losee (1994) found, for example, performance on the information retrieval task improved less rapidly when dependence information from beyond the window size of plus or minus three terms was incorporated. There are several applications where the assumption that the strongest relationships occur within a specific window size might prove useful. One possible application of such a limit in information retrieval might be to provide a limit within which query terms must appear. Most commercial retrieval systems already allow the user to specify the number of words that may appear between terms; this would provide a linguistic basis for determining the limit, and would provide a default limit. It could also be used to determine a cut-off point for calculating dependence information (as in Losee, 1994), to design weighting schemes on query terms that appear near each other, or to specify a limit within which terms that appear near a target term should be expanded using synonyms or other thesaurus relations. Other applications could be found with corpus or text analysis. Some of these, such as identifying common phrases, determining co-occurrence relationships, and finding sublanguage terms, have already been discussed above. Others could include calculating the strength of syntactic relationships (Sheridan & Smeaton, 1992), building dictionary definitions, or other types of knowledge acquisition tasks. The research described in this paper explicitly studies the effect of changing window size on task performance in an attempt to determine the best estimated size for the task. The chosen task should be one in which the variety of linguistic relationships could be expected to have some effect, either separately or together. Information retrieval satisfies this criterion. Retrieval based on looking for words from the query occurring in the title or abstract of a document is not very good. One problem with this type of retrieval is that the words may occur anywhere, separated by sentences, or even paragraphs, thus losing any kind of relationship they had in the query. Retrieval can be improved by taking advantage of these relationships, which can be seen in the use of phrases as index terms rather than merely single words. In addition, information retrieval provides clear ways of measuring differences in performance. Specifically, we examine (1) how performance on an information retrieval task varies as window size is changed, (2) if there is a range of effective window sizes for this task, (3) how the size of the most effective windows relates to those used in previous research, and (4) what inside the windows gives this method its power; that is, what sort of relationships seem to be present. Although the method described here could certainly be used for information retrieval, we are using the information retrieval task mainly as a way of measuring the effect of changing the window size. Further examination of the windows does allow us to suggest ways in which retrieval could be improved, using the window size as a range in which the improvements would be applied. 4. EXPERIMENTAL

DESIGN

Information retrieval systems may serve as testbeds on which to evaluate the impact of varying window sizes on the computerized manipulation and matching of natural language text strings. These test systems determine the degree of similarity between a query and a doc-

622

S.W.

HAAS and R.M.

LOSEE, JR

ument represented by a text string, and rank the documents by their degree of similarity to the query. By limiting the portions of the document to be considered for matching with the query, and holding other factors constant, changes in performance as window sizes are varied can be interpreted as indicating changing degrees of “meaning” or “aboutness” being captured by the windows. This occurs regardless of the actual means conveying the “aboutness” in the window, which could include noun-noun or adjective-noun strings, isolated content words, dependency relationships, or other linguistic phenomena. The database being used for testing the effects of different window sizes is based on the CF (cystic fibrosis) database (Shaw et al., 1991), which contains 1239 article titles, 100 queries, and judgments about the relevance of each document to each query. The 100 queries in the database are questions in natural language form proposed by members of an internationally recognized team of CF researchers at the University of North CarolinaChapel Hill Medical School. Document titles and abstracts were obtained for all those documents that were indexed by the subject heading “‘Cystic Fibrosis” during the 1974-1979 period in the Medlars database. Members of the research team evaluated each title and abstract to determine its relevance to each of the natural language queries. An extract of this database, referred to as the FULL784/87 database, contains only the 784 documents having published abstracts attached and the 87 queries having the greatest number of relevant documents. The number of relevant documents per query ranges from 2 to 64. CONTENT784/87 contains the same sets of data as FULL784/87, except that all occurrences of 203 stopwords were removed from queries and document abstracts. CONTENT784/87 has about 60% of the number of word tokens contained in FULL784/ 87. These databases currently reside on a UNIX workstation, and most of the information retrieval and evaluative code consisted of Bourne shell scripts. The similarity between a query and a document is determined for the purposes here as the greatest similarity between the query and any of the possible windows in the document. In turn, the similarity between a query and a single window is computed as the number of word types (i.e., specific words as opposed to actual occurrences of words) in common between the query and the window. The number of types in common between the query and the best window is referred to as the Coordination Level Match (CLM) (Losee, 1987). Counting the number of matching word types between a document window and a query is an easily understood method of computing similarity that can be shown to be a simplifying case of many more sophisticated similarity measuring techniques (e.g., probabilistic or vector retrieval methods). In summary, the part of the document with the greatest number of term types in common with the query will represent the document. The similarity between this document window and the query is the CLM, which provides a measure of the similarity between the two natural language phrases. For retrieval purposes, documents are arranged in decreasing order of their CLM for a given query. Thus, those with a window containing more matching words are ranked above those with fewer. The Average Search Length (ASL) for the query is then computed as the average number of documents retrieved when retrieving a randomly selected relevant document. Using the ASL provides a single number performance measure that directly relates to the time a user spends searching, and is analytically tractable at the same time. A small ASL indicates good performance, whereas larger ASLs indicate decreasing levels of performance. That is, in order to find any relevant document, more nonrelevant ones would be encountered first. Ideally, the relevant documents would be highly ranked, and thus found immediately, giving a small ASL. Results reported below are expressed in terms of an ASL for a given window size and database.

5. RETRIEVAL PERFORMANCE The documents in the FULL784/87 and CONTENT784/87 databases were ranked for retrieval based on the maximum degree of commonality between the query and any phrase in the document. Obviously, the more information a retrieval system has about a document, the better the performance is expected to be. These tests examine the change in performance as the window size changes. Performance appears to change at different rates,

Looking

in text windows

623

depending on the window size used. These rate differences suggest that there is a best estimate window size of approximately plus or minus three to five tokens when full text of document abstracts is used. Retrieval results when using both full text (FULL784/87) and text with stopwords removed (CONTENT784/87) are shown for varying window sizes in Fig. 1. Although there is a continuing decrease in the average search length as the window size increases, representing performance improvement, the greatest improvement appears to be when the window sizes are small. The marginal decrease in ASLs shown in Fig. 2 suggests that for full text (FULL7841 87), the greatest increases in performance occur when increasing a window size smaller than plus or minus five tokens. At this limit, a dip in the marginal decrease in ASL appears to occur, probably because terms at the edge of a window contribute little additional information capable of improving the retrieval process. We call this the “inter-window dip.” Increasing the window size beyond plus or minus five tokens results again in a greater marginal decrease in ASL. Due to the results from the CONTENT784/87 database, described below, we assume that this is followed by another dip. We believe the inter-window dips are due to the natural window characteristics of the text. Recall that the windows discussed here are self-selecting, in that a document is retrieved based on the window or windows containing the greatest number of matching tokens. A text window can be thought of as a sequence of useful matching words, surrounded by some less useful words. These sequences may be grammatical constituents or phrases, although that is not necessarily the case. The ideal window just fits over the useful words, excluding the less useful words between windows. The two inter-window dips might

ASL for Full Text and Content Only Databases 400

350

300 3 % ;250 F zl 3 200 1 %I e 150 g! u 100

50

0

Fig. 1. The average search length varies with the window content only (CONTENT784/87) databases.

size for the full text (FULL784/87)

and

624

S.W. HAAS and R.M.

LOSEE, JR.

Dampened Marginal ASL

Window Size Fig. 2. The decrease in average search length as the window size increases reaches a minimum at about 10 terms (plus or minus 4 or 5 terms) for the full text (FULL784/87) database and at 7.5 terms (plus or minus 3 terms) for the content only (CONTENT784/87) database.

indicate that the window size is crossing the boundaries between useful word sequences, and that windows tend to be bounded by tokens that are less subject bearing than tokens toward the middle of a window. An analysis of the sequential pattern of term discrimination values using Fourier analysis provides no support for the presence of long duration waves of the size suggested by this or other window research. Over half of the text of Alice in Wonderland was translated into a set of term discrimination values, with high values being assigned to mid-frequency terms and low values to very high and very low frequency terms. This numeric vector was then analyzed by the Fourier analysis program in Mathematics to look for simple waves in the data. None were found, but more sophisticated forms of analysis might be able to discern waves of window-like structures that were missed in this analysis. This preliminary attempt at looking for windows through Fourier analysis implies that useful windows are not evenly scattered throughout the text, but must be selected in some other way, such as by the number of matching words. The CONTENT784/87 database also shows a dip in the marginal decrease in ASL. Because many types are omitted from the content-only database, the inter-window dip occurs at a much smaller window size than for the larger-windowed full text database. The presence of a dip in this database indicates that the words between the windows are not merely strings of stopwords, such as is in the. This data set was smoothed by averaging the ASL for pairs of adjacent window sizes (e.g., 2 and 3, 3 and 4, etc.). 6.

LOOKING

IN THE

WINDOWS

In addition to the quantitative analysis described above, windows generated for individual queries were examined to see if there was a qualitative difference between those queries with better ASLs and those with worse ASLs. In particular, we wished to determine if windows associated with more successful queries contained more or stronger instances of any of the variety of linguistic phenomena discussed earlier. Four queries with low (good) ASLs and four with high (poor) ASLs were chosen for examination. These query sets will be called L-ASL and H-ASL, respectively. Example L-ASL and H-ASL queries were:

Looking

L-ASL: H-ASL:

in text windows

625

What is the role of viral infection in the lung disease of CF patients? What abnormalities of prostaglandin metabolism have been described in CF patients?

Some attempt was made to choose queries with similar characteristics, although exact matching was not possible. Table 1 shows some characteristics of the sentences in L-ASL and H-ASL, such as length, number of distinct word types, etc. The last line of the table describes the use of query words such as what is the, which are frequently used to formulate a question. Much of this analysis uses a larger window size than that which produced optimal retrieval performance. This allowed us to look for larger structures surrounding the smaller window size, as well as structures in the smaller windows.

6.1 An apparent ceiling on the number of possible matches The maximum number of word type matches for any window size is the minimum of the window size itself or the number of word types in the query. So, for a window size of 13, it would be possible to match 13 words from the query, if the query contained that many. In practice, however, the number of matches seemed to increase toward and then level off at about 7. For windows of size 13, two of the L-ASL queries matched 7 words, and two matched 5. Two of the H-ASL queries matched 7 words, one matched 6 words, and one matched 5 words. It is not clear that continuing to expand the window size would immediately result in more matches. For one query with a low ASL, seven windows of length 15 were identified containing 7 matching terms. In examining the abstracts from which these windows were drawn, it was seen that the window sizes could have been expanded up to 20 words before additional matches were obtained. This low ceiling seems surprising, given that matches could include the common stopwords, and other common words in the collection such as patient and cf. The next step was then to try to determine other limits on the number of matching terms. One possible limit is the number of word types in the queries, shown on the second line of Table 1. For L-ASL queries, the minimum was 8, for H-ASL queries, the minimum was 9. However, both sets also contain queries with many more word types; at least for these queries, many more matches are possible. A further limitation would occur if any of the query terms did not appear in the collection at all. Two words used in the queries were not used in the collection, one from an L-ASL query and one from an H-ASL query. Seven other query words had a frequency of less than 5 in the collection. Three of these were from L-ASL queries, three from H-ASL queries, and one from both. This last was what, a query formulation word. Since neither the numbers of words used in the queries nor the frequencies with which they were used in the collection seems to limit greatly the possible number of matches in windows, another explanation must be found. It seems reasonable to conclude that words used in close proximity in the queries (in the same sentence, or occasionally in two sentences) are not used together in the abstracts. Some of these words may not be used together in the same abstract at all, others may be in the same abstract, but at a distance from each other. However, we cannot conclude that any window that does contain a high number of matches indicates a relevant document, since there was no dif-

Table 1. Characteristics

of L-ASL

and H-ASL

L-ASL

Word tokens Word types Content words Stop words Query words

queries H-ASL

Avg

Min

Max

Avg

Min

Max

14.5 12.25 8.75 5.75 2

8 8 6 2 1

25 18 14 11 3

13 12.75 7 6 2

9 9 5 4

18 17 9 9 3

1

S.W. HAASand R.M. LOSEE,JR.

626

ference between the L-ASL and H-ASL queries in that regard. The difference may lie in the nature of the matching words themselves. Windows of size 13 from relevant documents for L-ASL queries were found to contain more matching words than those for H-ASL queries. For example, windows containing 6 matching words were found in relevant documents for L-ASL queries. For H-ASL queries, 4 was the highest number of matching words found in windows of the same size. Further, the L-ASL queries retrieved many more of these windows. Looking for windows of size 13 containing 4 or more matching words, the documents relevant to the L-ASL queries produced 193 of these windows. The documents relevant to the H-ASL queries produced only 32. So although there was no difference between the L-ASL queries and the H-ASL queries in terms of the number of matching words found in all the windows, there was a difference in the number of matching words found in the relevant documents. The words in the L-ASL queries occurred in proximity to each other in the documents relevant to the queries, and these clusters could be found using the windowing strategy. Words in the H-ASL queries did not cluster together in their relevant documents, and so the windowing strategy could not use them to rank these documents higher than nonrelevant documents. The lack of clustering might also indicate that words related to each other in describing a concept in the query were not used in the same way in the relevant abstracts. 6.2 Contiguity of matching words in windows We then examined whether the matching words in both relevant and nonrelevant document windows occurred in isolation or in contiguous strings. If the matching words occurred contiguously in phrases, then that might provide some evidence that the windows were identifying concepts in the documents that had been expressed in the queries. If the matching words appeared in isolation, then that was less likely to be the case. As an example, consider a query containing the phrase the lung disease of CFpatients. A window containing the three contiguous matching words the lung disease represents a portion of a document discussing the same concept as one expressed in the query. A window containing the three noncontiguous matching words the, of, and patients would receive the same CLM score, but is not as clearly related to the query. If the contiguity of matching words does affect retrieval performance, then there should be a difference in the numbers of contiguous matching phrases in the relevant and nonrelevant document windows. A further difference might be in the type of matching words; strings of content words are more informative than strings of stopwords. For instance, the string the lung disease contains two content words, and represents a specific concept. The string is in the contains no content words, and does not indicate any particular topic. The longest contiguous strings of matching words were of length 3. Table 2 shows the numbers of two- and three-word matching strings found in windows of size 13 for L-ASL and H-ASL queries. There were no strings containing three contiguous content words, so we also looked at three-word matching strings that contained two content words. The total number of two- and three-word strings found in all documents did not differ much between the sets of queries, nor did the number of two- and three-content word strings. However,

Table 2. Two- and three-word matching strings in windows of size 13 L-ASL

2-word matches 2-word content omitting cfpufienfs J-word matches 3-word content ?I-word with 2 content omitting cf patients

H-ASL

All

Relevant

All

Relevant

5723 1617 608

112 74 40

5925 1859 13

0 0 0

816 0 496 270

17 0 17 17

589 0 417 21

0 0 0 0

Looking

in text windows

627

when the extremely common phrase cf patient (163 occurrences in the database) was removed, the L-ASL windows had many more content strings remaining. The other striking difference is in the number of strings in windows from relevant documents. There were no two- or three-word matching strings found in windows from relevant documents for the H-ASL queries. The L-ASL queries did contain contiguous strings of words that also appeared in windows from relevant documents. The H-ASL queries contained almost as many contiguous strings of content words as the L-ASL queries. For example, the L-ASL queries contained eight two-word strings, the H-ASL queries contained seven. However, the H-ASL strings did not appear in windows from the relevant documents. It is possible that the difference in the number of matching content word strings is not due to the matching per se, but rather to the number of content word strings in the documents themselves. Windows from relevant and nonrelevant documents were examined to see if there was a difference in the number and length of content word strings in the documents. There were no marked differences between the L-ASL and H-ASL queries in this regard. Two- and three-word strings were common; four- and five-word strings were not uncommon. It is interesting to note that the relevant documents for both L-ASL and H-ASL queries did not contain any content word strings longer than 5. Windows from the nonrelevant documents had strings of as many as eight content words in length. (These generally crossed both constituent and sentence boundaries.) It therefore seems unlikely that the difference in content word string matching was due to the “availability” of strings in the documents. It is also possible that the difference in the number of matching content word strings were at least in part due to the contexts in the documents in which the content words from the queries occurred. We examined the class of the words, (i.e., content or stopwords), with which the content words were paired (as either the first or second member) in the documents. There was little real difference in the number of all word pairs and content-content word pairs between the L-ASL and H-ASL queries. However, as before, removing the common content-content phrase cf patient did reveal a difference. The L-ASL query content terms appeared in a total of 747 content-content pairs in the collection. Ninety-five of these were pairs other than cf patient. The H-ASL query content terms appeared in a total of 494 content-content pairs, only five of which were not cf patient. It seems clear from these findings that this phrase, which was common in both the abstracts and the queries, was increasing many windows’ CLM score, without differentiating between relevant and nonrelevant documents. The difference between the L-ASL queries and the H-ASL queries could be that the L-ASL queries contained additional terms that were able to contribute to the CLM score and provide information about the documents’ relevancy. The H-ASL queries did not, at least not to the same extent. Their CLM scores were determined in part by less useful matches, such as cf patient and stopwords. Is it the case, then, that the stopwords just provide spurious CLM points without actually contributing to the quality of the retrieval? That is, should the similarity between query and document be determined without including stopwords? As the results presented in section 5 show, there is some evidence that this might be the case. The CONTENT784/87 yielded lower ASLs than FULL784/87 at each window size, indicating that omitting the stopwords from the CLM gave better performance. So it is possible that at least for this task, windows can omit stopwords without losing the effects of whatever relationships are at work, thereby diminishing the effects of the unhelpful matches they form. The examination of the contiguity of matching query items in windows of relevant documents leads to some interesting thoughts. The L-ASL queries contained strings of contiguous content words other than the very common cf patient that also appeared in the relevant abstracts. Although the H-ASL queries contained about as many contiguous matching content words, the words appeared too frequently in the abstracts to be able to distinguish relevant and nonrelevant documents. We can then hypothesize that if the H-ASL queries could be rephrased to include more uncommon content words in contiguous strings, their retrieval performance would improve, because these strings would match terms in windows in relevant documents. This would be a logical next experiment to perform.

628

S.W. HAASand R.M. LOSEE, JR

6.3 Linguistic relationships in the windows It is not clear from the analyses presented above precisely what type of relationship is working at any one time. Sometimes co-occurrence seems to be playing a role; at other times, case or other kinds of syntactic relationships that bring query words into close proximity with each other in the documents seem to be more important. The strength of the windows is that all of these can be exploited at once, without actually having to do the required analysis. We next examined the word pairs in which content words from the queries appeared in the abstracts. We built a concordance of word pairs for the content words, with the frequency for each word pair. We were interested in the range of linguistic relationships seen in the pairs, not just their frequency or strength of association. Several interesting patterns emerged from this analysis. Some terms appeared with many different content words, forming pairs that occurred only once or twice in the collection. These terms seemed to have little identity of their own, forming concepts only in combination with other words. An example of this is the word abnormalities. In a medical database, its appearance by itself is not very informative, but it can be paired with many other words to form a specific concept, such as cardiovascular

abnormalities. Some terms appeared almost exclusively with one word. In the case of a contentcontent pair, this generally forms a phrase that describes a single concept, such as pseudomonas aeruginosa. Content-stopword or stopword-content pairs frequently indicated the case role or function of the content word (and often of the word following), as in composition of or is described. Another interesting pattern occurred when the query term occurred with a small set of content words that could be considered synonyms, near-synonyms, or more specific or general terms of each other. For example, the term normal occurred with the word patient, but also with the words neonate, adult, children, and females, which are more specific words describing particular classes of patients. Sometimes the word with which a term appeared in the query was not the word with which it occurred the most in the abstracts. For example, one query asked about vitamin d, which was used a few times in the abstracts. However, d was used more often in the phrase d pneumoniae. The query term was closely associated with two concepts in the collection that were unrelated to each other. In other instances when a query term occurred with more than one word in the abstracts, there was a clear relationship between the phrases. For example, the word lung was frequently used in the phrases lung disease and lung function. A disease of a body part generally is associated with some problem with the function of that body part. At the end of the preceding section, we hypothesized that one way to improve retrieval would be to rephrase queries to include more strings of contiguous, uncommon content words. The linguistic patterns described above suggest some other ways of improving the performance of this window-based method of retrieval. In essence, we want to strengthen the relationships found in the windows, without introducing more unrelated terms or concepts. We can think of the windows as indicating some words or phrases that are important because of their relationship(s) with the query words, which can then be manipulated to improve the retrieval results. The window size limits the amount of text to which these computationally more expensive processes must be applied. This is similar to other familiar ways of expanding the query, but in this case, the abstract could also be “expanded.” Another way of phrasing this is “how can we rewrite the abstract so it looks more like queries to which it will be relevant ?” We propose another experiment to test whether using a domain-specific thesaurus in conjunction with window-based retrieval would improve retrieval performance. The thesaurus would capture the synonym, near-synonym, and hierarchical relationships between terms, allowing both the query terms and terms in abstracts to be expanded in ways consistent with the subject matter. It could also incorporate other types of relationships found in the abstracts, such as that between a body part, its funcMember-class, test-result, and symptom-disease tion, and its diseases or “malfunctions.” relationships would also be useful. The crucial point, however, is that the thesaural rela-

Looking

in text windows

629

tionships would not be used for all the words in the document, that appear in the windows surrounding the query terms.

but only for those terms

7. CONCLUSIONS

The words in a window of text can stand in many different relationships to each other. By utilizing a window, or span of text, information provided by these relationships can be exploited without having to analyze them in detail. This study has focussed directly on window size, using retrieval performance, namely, average search length, as a measure of the effect of varying the window size. The results give a more comprehensive picture of window size than that provided by previous research, and provide support for the effectiveness of a window size of plus or minus three to five. Omitting stopwords from the windows increased their power, as shown by an overall improvement in the ASL. A comparative analysis of windows extracted for a group of successful queries and a group of less successful queries revealed some interesting differences in the composition of the windows. Windows for successful queries seem to identify concentrations of useful matching words, while those for less successful queries match stopwords and very common phrases that do not discriminate between relevant and nonrelevant documents. We proposed an experiment to determine the effect on retrieval performance of rephrasing less successful queries to include more strings of useful matching words. It would also be interesting to investigate the text of the abstracts further, to see if abstracts not retrieved by this method lacked any concentrations that could be found by windowing. If this were the case, it might lead to some guidelines for writing abstracts that are more easily retrieved by full text retrieval methods. We also proposed an experiment to test our hypothesis that windows can be used to provide a principled span or limit within which to apply more intensive processing, such as expanding thesaural relationships in the text itself. For example, if the query contains the phrase normalpatient, the text in document windows could be counted as a match or near match if it contained phrases made of normal and any of the near-synonyms or more specific terms adult, female, neonate, etc. The window provides a context in which such expansion could be useful, and a limit on the amount of text that must be expanded.

REFERENCES Church,

K., & Hanks,

P. (1990). Word association

norms,

mutual

information,

and lexicography.

Computufronul

vocabulary.

Information Pro-

Linguistics, 16( 1). 22-29. Haas,

S., & He, S. (1993). Toward

the automatic

identification

of sublanguage

cessing & Management, 29(6), 721-732. Losee,

R. (1994). ‘Term dependence:

Truncating

the Bahadur

Lazarsfeld

expansion.

Inform&on

Processing &

Managemenl, 30(2). 293-303. Losee, R. (1987). Probabilistic

Retrieval

and Coordination

Level Matching.

Journal of fhe American Sociefy for

Information Science, 38(4), 239-244. Martin,

W., Al, B., & van Sterkenburg,

P. (1983). On the processing of a text corpus. In R. Hartmann London: Academic Press, Inc. F. (1989). Full text indexing based on lexical relations. An application: Software

(Ed.).

Lexicographv: Principles and practice (pp. 77-87).

Maarek, Y., & Smadja, Libraries. Proceedings of the Twelfth Annual International ACMSIGIR Conference on Research and Revelopment in Information Retrieval, pp. 198-206. Phillips, M. (1985). Aspects of text sfructure. Amsterdam: Elsevier Science Publishers. Shaw, Jr., W., Wood, W., Wood, J., & Tibbo, H. (1991). The cystic fibrosis database: Content and research opportunities. Library and Information Science Research, 13, 347-366. Sheridan, P., & Smeaton, A. (1992). The application of morpho-syntactic language processing to effective phrase matching. Information Processing & Management, 28(3), 349-369. Smadja, F. (1989). Lexical co-occurrence: The missing link. Liferury and Linguist/c Computing, 4(3), 163-169.