A learning scheme for information retrieval in hypertext

Information Pergamon Processmg & Management, Vol. 30, No. 4, pp. 515-533, 1994 Copyright 0 1994Elsevier Science Ltd Printed in Great Britain. All...

Download PDF

2MB Sizes 32 Downloads 114 Views

Report

PDF Reader
Full Text

Information

Pergamon

Processmg & Management, Vol. 30, No. 4, pp. 515-533, 1994 Copyright 0 1994Elsevier Science Ltd Printed

in Great

Britain.

All

0306.4573/94

rights

reserved

$6.00

+

.oO

A LEARNING SCHEME FOR INFORMATION RETRIEVAL IN HYPERTEXT JACQUES SAVOY Umversite de Montreal, Departement d’informatique et de recherche opCrationnelle, P.O. Box 6128, station A, Montrtal, Quebec H3C 357, Canada (Received 12 November

1992; accepted m final form 13 September

1993)

Abstract-In proposing a searching strategy well suited to the hypertext environment, we have considered four criteria: (1) the retrieval scheme should be integrated into a large hypertext environment; (2) the retrieval process should be operable with an unrestricted text collection; (3) the processing time should be reasonable; and (4) the system should be capable of learning in order to improve its retrieval effectiveness. To satisfy these four criteria, we have designed and implemented a search strategy for hypertext systems based on an extended Boolean model (the p-norm scheme) and supplemented it with links to improve the ranking of the retrieved items in a sequence most likely to fulfill the intent of the user. These links, representing additional information about document content, are established according to the requests and relevance judgments. Using a fully automatic procedure, our retrieval scheme can be applied to most existing systems. Based on the CACM test collection, which includes 3,204 documents and the ClSI corpus (1,460 documents), we have built a hypertext and evaluated our proposed strategy. The retrieval effectiveness of our solution presents encouraging results.

Keywords: Hypertext, lnformation retrieval, Learning tics, p-norm model, Probabilistic retrieval model.

scheme,

Hypertext

link seman-

1. INTRODUCTION

In a toy-sized hypertext, searching information does not present a real problem, where both global and local access methods to information seem to be satisfactory. Firstly, hypertext systems promote global views of an information network by using various tables of contents, indexes, or global maps to aid user orientation or to indicate the location of interesting nodes. Secondly, local maps or semantics associated with links can guide readers locally to the appropriate nodes when navigating through the hypertext. According to an empirical study by Alschuler (1989), when the number of nodes and links increases, these tools are no longer satisfactory. For example, indexes are divided into different levels, only one level is presented at a time, and they are sorted according to nonstandard conventions (partial alphabetical order). Sometimes an index entry is written using a synonym rather than the word the reader is thinking about. Tables of content reflect the structure of the documents, and therefore are not adequate in searching for specific information (i.e., the text of the table entry may be too short or the meaning too broad, a particular document title may have more than one meaning, etc.). Thus, effective query-based access is required to search information stored in a large hypermedia network (Halasz, 1988). Solutions proposed to date are based on the Boolean model, a hybrid Boolean strategy, or the vector-processing scheme. The study of the retrieval effectiveness of these approaches reveals that they cannot find all the relevant documents in response to a user’s request (Savoy, 1993a). Moreover, these models are not able to learn in order to increase their performance over time. This paper suggests a learning scheme as a new approach to information retrieval in hypertext as a means of improving retrieval performance. Correspondence should be addressed to Faculte de droit et des sciences konomiques, Pierre-g-Maze1 7, CH-2000 Neuchltel, Switzerland. 515

Umversk

de Neuchltel,

J. Savov

516

Section 2 presents our basic retrieval scheme, within which the system must operate without relevance feedback information. Section 3 describes our learning scheme, and Section 4 reviews basic probabilistic retrieval models, and compares evaluation results obtained by our proposed approach with those generated by basic probabilistic retrieval techniques. 2. SEARCHING

INFORMATION

IN HYPERTEXT

A smalI hypertext does not reveal the problems generated by large text collections. This is analogous to the fact that programming on a large state is different from programming on a small scale: System complexity grows a5 the square of the number of system elements; therefore, experience with a small system cannot account for all the things that will have to be done in a large system. (Aron, 1983, p. 577)

Thus, to design a search strategy, we were faced with similar difficuhies. Working with a f~rg@~n~~s~r~~~ed text coflection, we cannot control the input language, implying that the retrieval strategy must be based on robust methods, instead of specialized mechanisms applicable only in narrow contexts and requiring a large amount of manual work (i.e., building a dedicated knowledge base or semantic network; see, for example, Nie, 1989). Our prior feeling is that a word like “apple” or “Macintosh” may have a clear meaning in a corpus dealing with botany. In an unrestricted document cohection, the same words may have very different meanings (Furnas el al., l987); for example, “apple” can be interpreted as a fruit or as a computer manufacturer; in a particular context, “big apple” means New York; ‘~~ac~ntosh~’ may signify a kind of apple, a microcomputer, or a person’s name, etc. This qualitative analysis of lexical ambiguity may be complemented by quantitative aspects (Kroverz & Croft, 1992) showing that even in a specialized collection there is considerable semantic ambiguity. Thus, to retrieve content-based information within a large corpus, the jstern must identify the content of each node and build its surrogate (Section 2.1). As a p~tmary search process, we propose using the p-norm model (Salton ef al., 19831, which can operate with large textual collections and assures rapid processing time through simple approximations (Smith, 1990, Chapter 5) (Section 2.2). 2.1 Indexing procedure In our experiment, to outline the semantics of each document, or node D,, i = 1, 2, . . . . n, we automatically index them according to Salton’s principles (1989, Chapter 9). We: i . find individual words (sequence of letters or digits); 2, use a stop list to remove common words (i.e., the, is, . . _f (In our experiment, this list is formed by the union of the stop list proposed by Fox, 199% and the one described in van Rijsbergen, 1979. The resulting list contains 488 terms.); and 3. use a suffix-stripping algorithm to produce stems or concepts.

Finally, to represent the weight w,/, of each concept or single-term T,, k = 1, 2, . . . , t, in a node D,, we use the following well known formula:

w

=

g+*idf;

\vhere

idjf; = log

where ff,,k is the frequency of the keyword Tk in the document D,, n the number of nodes D, in the hypertext, dfk the number of documents in which T, occurs, and idfk the inverse document frequency. An example is given in Table I. This indexing scheme considers the importance of a term using both the document frequency (ff components and its specificity (i# corn~~ent~~ This specificity does not depend on a semantic property of the given keyword, but is derived from a statistical notion, or

517

A learning scheme for hypertext Table 1. Example of weighted indexing terms Term

D, DZ D1 04 -

0.10 0 0.20 0.45 -

0.20 0.15 0 0

0.50 0.10 0 0 -

0.05 0.60 0.30 0.45

-

as Sparck Jones says, “we think of specificity as a function of term use” (Sparck Jones, 1972, p. 13). For example, the word “computer” may be viewed as very specific in a legal hypertext because this word appears rarely, whereas in a computer science corpus, it is very broad. Thus, with this weighting scheme, large documents are indexed by a large number of terms, and thus have a better chance of being retrieved than smaller texts. The normalization process presents a solution to this problem. To normalize both components to the range [O,l], we divide tf by the maximum tf value for any keyword in the document (Salton et al., 1983, p. 1029, eqn 13), and following Turtle (1990, p. 120), we divide idf by the logarithm of the collection length:

Our indexing process represents documents and queries with words, or more precisely, with stems. Such an approach introduces two types of problems: (1) the lexical ambiguity leads to retrieval of nonrelevant documents that share at least one common keyword with a request, but use this term with a different meaning or context; and (2) relevant documents cannot be retrieved because their representatives do not contain exactly the same terms as those used in the request. In our experiment, we have built a hypertext with the CACM collection (3,204 titles and some abstracts of articles from the journal “Communications of the ACM”). This corpus includes 50 Boolean queries, and 5,823 unique index terms. The CISI collection contains 1,460 documents extracted from the information science literature, and includes 35 Boolean queries (76 queries in natural language) and 5,935 unique index keywords. 2.2 The p-norm model Using the traditional Boolean model, the user may formulate a structured Boolean query in order to express an information need. During the matching process, a request represents the logical conditions a document must obey in order to be retrieved. Thus, a document is returned only if it satisfies the Boolean query exactly. For example, based on Table 1, the query (T, AND T4) will retrieve Documents 1, 3, and 4 with equal strength or with the same retrieval status value, which is equal to 1. In the p-norm model (Salton et al., 1983), the ranking factor depends not only on the number of keywords in common with the request and the documents, but also on the weights assigned to index terms and on the p-value attached to each query operator. Thus, for each node D,, the retrieval status value is computed recursively according to Table 2, in which w,, represents the weight attached to index term A in document II,, and (p) the p-value assigned to each operator. Of course, the ranking scheme lists retrieved documents according to decreasing order of retrieval status values, and ties (if any) are broken with the publication date. For example, using index keywords shown in Table 1, the query (T, AND (2) T4) will retrieve Documents 1, 2, 3, 4 and rank them as (D4, 0.45), (D3, 0.248), (Dz, 0.238), and (D,, 0.075).

J.

518 Table 2. Retrieval Binary

SAVOY

status value in p-norm

query

Retrieval

model

status value

’ wp+w;

J Iu

A OR (P) B

(I -

AAND(p)B

2

W,“)” + (1 - W,h)”

I2

NOT A

1 - w,,

The p-norm model allows the use of the soft Boolean operator, and generalizes both the Boolean model and the vector-processing scheme. For example, the AND operator no longer imposes the presence of the two index terms in a document in order to be retrieved (see Document 2 in the previous example). The generalization obtained by the p-norm model depends on the value attached to the parameter (17). On the one hand, setting this parameter to infinity, the Boolean operators behave as standard Boolean logic. On the other hand, when (p) has the value 1, the distinction between the operator AND and OR vanishes completely. Moreover, this model allows users to introduce query term weights in order to reflect the importance of each topic included in their requests (Salton et al., 1983). The processing time of the basic p-norm model is slower than the traditional Boolean model which, using the well known inverted file organization, allows a rapid processing time. According to some approximations proposed by Smith (1990, Chapter 5), these two models may have a comparable speed without altering the retrieval effectiveness of the p-norm scheme. However, this approach does not incorporate a learning scheme in order to improve the retrieval effectiveness.

3. A LEARNING

SCHEME

The aim of a learning scheme adapted for information retrieval is to have the system record its successes and failures in order to increase its performance. To define such a scheme, we have to specify the underlying hypotheses, and determine how the system learns and how it stores and uses the knowledge provided by previous experiments. Finally, the learning scheme must specify the way of the system operates in the absence of prior information. Our learning scheme is based on the following hypotheses: 1. Nodes known to be relevant to the same query tend to contain similar concepts and must deal with similar subjects. 2. No conclusions can be drawn about documents found nonrelevant for a given request. On the one hand, our learning scheme is based exclusively on successes or on the presence of couples of retrieved and relevant documents. On the other hand, our procedure does not take into account the shared presence of retrieved and nonrelevant nodes. Nonrelevant items retrieved by the system are those documents that have at least one common keyword with the request. However, such keyword matching does not always imply word sense matching: “Word sense mismatches are far more likely to appear in nonrelevant documents than in those that are relevant” (Krovetz & Croft, 1992, p. 139). In order to represent the information given by the previous experiments or requests, we have designed a special link type called a relevance link. This link type connects two nodes found relevant for a given query. With each link, a relevance value specifies how many times both the linked nodes are found relevant.

A learning Table 3. Examples

Query 1 2 3 4 5

519

scheme for hypertext of relevance

information

Relevant 397 3,7,1 297 5,lO 10,ll

nodes

I

Based on the relevance information given by Query 1 (see Table 3), the system establishes a relevance link between Nodes 3 and 7. The relevance value of this link is set to one. When information about Query 2 is known, two new relevance links are created, one between Nodes 3 and 11, and one connecting Nodes 7 and 11. Since a relevance link is already established between Nodes 3 and 7, its relevance value increases by one. Figure 1 shows the relevance links with their relevance value when the information given by the five requests of Table 3 are taken into account. From Table 3, one can see that a document does not always appear jointly with the same nodes. For example, the relevance list for Query 1 includes Nodes 3 and 7; however, Documents 2 and 7 form the expected answer to Query 3. Such phenomena reflect the fact that the relevance information of a query is not based on the same unit of information (or proposition), because a large paper is not normally concentrated on one narrow subject; rather, it must be considered as a many-faceted entity. To include the information provided by the learning stage, our retrieval scheme works in two phases. In the first, the retrieval status value of each node is computed according to thep-norm model (see Section 2.2). In the second stage, the ranking of retrieved documents is modified according to the presence of relevance links according to eqn 3.

RSV(D,)

=RSG,,(D,)+

CCY,~.Z?SG,,(D~) k=l

fori=

1,2 ,...,

n,

(3)

in which Q,h reflects the strength of the link between nodes i and k. At the initial stage, the retrieval status value of a document depends only on the similarity between its content and the query (ML&, (Di) computed according to Table 2). The value (Y,/,can be either a constant or a function of the relevance value of the link connecting nodes i and k. However, our evaluation results shown in Section 4.5 indicate there is no significant difference between these two strategies.

Fig.

I. Relevance links based on Table 3.

520

J.

SAVOY

For example, in Fig. 2, the p-norm model attributes a retrieval status of 0.8 to Document 11. According to Formula 3, this weight is propagated through links to Documents 3, 7, and 10. If we define the strength of the link between nodes 11 and 7 as 0.3, Document 7 will increase its retrieval status value by 0.24. In order to improve the efficiency of our retrieval scheme and to obtain a reasonable processing time, we modify the retrieval status value, not for all retrieved nodes, but we select the first ten best-ranked documents after the initial stage to activate the relevance links. We believe that relevance links indicate semantic relationships between documents, and may be valuable in the searching process. Although Blair (1990) considers such scheme to be a useful pedagogical tool, he questions its retrieval effectiveness: Bush (1945) recognized early . of how inquirers could benefit from the “traces” left by searches conducted by informed inquirers. While this is an important notion, realistically each inquirer’s searches are unique enough that a record of previous searches might only be marginally useful for finding specific information. (Blair 1990, p. 181)

To resolve this question, the following section presents the retrieval effectiveness our learning scheme and a comparison with probabilistic retrieval models.

of

4. RETRIEVAL EVALUATION Although other learning strategies have already been proposed and evaluated, most of them are directly related with the probabilistic retrieval model or have some relationship with it (i.e., the neural network approach described in Kwok, 1990a). From a different perspective, Gordon (1988) suggests a learning scheme based on a genetic algorithm, which enhances retrieval effectiveness. In this approach, an iterative process affects document surrogates by including or removing index terms based on (1) alternative descriptions of each document; and (2) the request and relevance judgments. These competing document descriptions are obtained using various indexing policies (i.e., based on document abstract, on titles, using full-text or derived from a manual indexing process). The aim of this section is to explain the main characteristics of the probabilistic retrieval model with which our learning approach will be compared. Thus, the first section presents the basic probabilistic retrieval model. In the second, the term significance model of Croft (1983) is outlined. Since they query formulations are radically dissimilar in our scheme and in these probabilistic models, the third section explains by an example the main differences and their implications. The evaluation tests are described in the fourth

Retrieved relevance

by lollowing links

Fig. 2. Retrieval of information

using relevance Irnh\

521

A learning scheme for hypertext

section, and the last exposes the retrieval effectiveness of our learning scheme and compares it with different probabilistic approaches. We must first define the decision rule when comparing two retrieval strategies, a judgment which must be based on objective measures. In this paper, our evaluation is essentially limited to recall and precision measurement. Recall is defined as the proportion of retrieved documents that are relevant over the total number of relevant documents in the collection. Precision is defined as the ratio between the number of retrieved and relevant documents and the number of retrieved items. These values are well known as a measure of the capability of a system to select relevant information, and to reject nonrelevant documents. For each query, we build a precision-recall curve and compute the average precision at ten standard recall values (van Rijsbergen, 1979, Chapter 7). To define this curve in our study, we compute a precision value for each new relevant document retrieved, and these precision values are then interpolated to obtain precision values at ten standard recall points. For our evaluations, we have chosen the neo-Cleverdon interpolation method (Williamson et al., 1971, p. 43). To decide whether a search strategy is better than another, the following rule of thumb is used: A difference of at least 5% in average precision is generally considered significant, and a 10% difference is considered very significant (Sparck Jones & Bates, 1977, p. A25). This decision rule may be completed with two nonparametric tests-the Wilcoxon matched-pairs and the sign test (Siegel, 1956, pp. 68-83). However, these two tests imply important assumptions not met in our current context: [The Wilcoxon Matched-Pairs] test is done on the differences D, = Z,(Q,) = Z,(Q,), it is assumed that D, is continuous and that it is derived from a symmetric distribution, neither of which is normally met in IR data. . . . [The sign test] makes no but

assumptions about the form of the underlying distribution. It does, however, assume that the data are derived from a continuous variable and that the Z( Q,) are statistically independent. These two conditions are unlikely to be met in a retrieval experiment. Nevertheless, given that some of the conditions are not met, it can be used conservative/y.” (van Rijsbergen, 1979, pp. 178-179)

For example, when comparing the average precision obtained with the traditional Boolean model and the p-norm scheme, we can conclude that the latter performs significantly better (see Table 4). In Table 4, the results of the traditional Boolean mode1 form the baseline from which the percentages of change are computed. The best (p) value seems

Table 4. Evaluation

of Boolean and p-iorm

retrieval models Precision (% change)

Model Classtcal Boolean model sorted by publication p-norm model p-value = 1 p-norm model p-value = 2 p-norm model p-value = 5 p-norm model p-value = 10 p-norm model p-value = 12 p-norm model p-value = 15 p-norm model p-value = 30 p-norm model p-value = 00

date

CACM collectton (50 queries)

CISI collection (35 queries)

23 2

22.5

36.0 (+55.4)

38.9 (+72.9)

35.5 (+53.1)

36.6 (+62.4)

36.4 (+56.8)

35.7 (+58.6)

38.0 (+64.0)

35.3 (+57.0)

38.5 (+66.2)

35.3 (+56.8)

38.1 (+64.5)

34.6 (+53.6)

36.6 (+58.0)

33.7 (+49.8)

24.5 (+5.5)

28.9 (+28.2)

5’2

J.

SAVOY

to be 12 for the CACM collection and 1 for the CISI corpus. These values are retained in the subsequent evaluations of our learning scheme. However, for the CACM collection, setting the value of(p) in the range [IO-151 does not significantly improve retrieval effectiveness over a baseline solution within which the p-value is equal to 12. The results of the p-norm model when the parameter (p> is set to infinity are not iden:ical to the traditional Boolean model (first line of Table 4). In fact, the classical Boolean model is based on binary index terms, whereas weighted index terms are used when evaluating the p-norm model.

In order to define a retrieval model that may outperform the traditional Boolean model and is based on theoretical groundings, various attempts have been made to define a probabilistic retrieval scheme. Such a retrieval model must explain how documents and queries are represented and how these representations are compared to produce a ranked list of retrieved items. The main ideas underlying this approach are given below. The basic probabilistic retrieval model assumes that documents are represented by binary index terms (Robertson & Sparck Jones, 1976; Croft & Harper, 1979; van Rijsbergen, 1979, Chapter 6). This basic scheme is also based on the independence assumption, which states that index terms occur independently in document representatives. Under this hypothesis, knowing that the keyword “computer” appears in a text does not give us further information about the probability of occurrence of word “IBM” or “Macintosh.” Various works have been done in an attempt to relax this stringent assumption (Sparck Jones, 1971), and particularly to take account for first-order dependence between words (Maron & Kuhns, 1960; van Rijsbergen, 1979; Savoy, 1992). To compute the retrieval status of each document, the simplest strategy consists of ranking documents according to the coordination match expressed in eqn 4. In this case, both documents and requests are represented by binary index terms, and the retrieval status value of a document depends only on the number of terms in common with the current query:

RSV(D,)

= f; h=l

A+,~‘.Y<,~ =

5

.Y,ht

h=l

in which .Q indicates the presence (x,~ = 1) or the absence (_Y,~= 0) of the index term TL in document D,. The value of .+ has the value 1 or 0 denoting the presence or absence of the term T, in the current request. We can simplify the computation of the coordination match not by considering all index keywords T,, k = 1, 2,. . t, but by restricting the summation to terms that actually occur in the query (for k = 1, 2, . . , y). It is recognized that when a document surrogate shares a sufficient number of keywords with the request, there is stronger evidence that this document i\ relevant (Sparck Jones, 1971, Chapter 2; Krovetz & Croft, 1992). Usually, however, few documents will respect this high-level matching. Moreover, in expressing an information need, the search terms can be weighted to reflect the importance attached by users to various topics included in their requests. This ability to weight each query keyword is recognized as an important feature of a retrieval of queries is more important than that of docusystem: “. . . the characterization ments; . . .” (Sparck Jones, 1981, p. 248) Moreover, each term does not represent the same discrimination power, and therefore a match on a narrow keyword must be treated as more valuable than a match on a common words. Equation 4 ignores this feature, for each query term has the same importance. In normal term co-ordination matches, if a request and document have a frequent term in common, this counts for as much as a non-frequent one; so if a request and document share three common terms, the document is retrieved as the same level as another one sharing three rare terms with the request. (Sparck Jones, 1971, p. 17)

A learning

523

scheme for hypertext

Thus, we can incorporate query term weight by replacing the binary variable x& by w# and calculate this weight by using the idf computation, as suggested by Sparck Jones, 1972. The retrieval status value of each document is computed as follows:

RSV(D,)

=

5

Wqk=

x,k’

k=I

k$, x,k'loi% $ [

with k

w# = log

1

[

n dfk

,

(5)

1

in which n represents the number of documents in the collection, and dfk the number of documents in which the keyword Tk occurs. This first attempt based on the relative specificity of each query term is rather ad hoc. A more formal derivation of w4k is obtained in the probabilistic retrieval model by making use of Bayes’ theorem and term-independence assumption postulating that the index terms occur independently in the relevant and nonrelevant document (for details see van Rijsbergen, 1979, Chapter 6). In this case, the weight Wqk is evaluated using eqn 6:

Wqk

=

log[+-]

+ log[*]Y

probability of knowing the document is relein which r# (sq,&) expresses the conditional vant (nonrelevant), its representative containing the index term Tk. By using eqn 6 to evaluate the coefficient Wqk in eqn 5, we obtain:

RSV(D,)

=

+log[!f$+

k x,k' wqk =~&kj%[~]

k=l

(7)

where the value Wqk does not represent a probability, but the weight or coefficient assigned to query term Tk. This formula is computed over all overlapping keywords between the request and the document D,. Depending on values assigned to both probabilities r& and s#, the retrieval status value of a document may be negative. In this case, the document is not returned to the user. However, this probabilistic model does not provide a means of estimating both values rQk and sqk when no relevant documents are known. If no relevance information is available (Croft & Harper, 1979), we could assume that all query terms have equal probabilities of occurring in the relevant document. Therefore, we may consider the probability r@ as a constant (i.e., 0.5). Kwok (1990b) suggests estimating this probability with the value 0.025 because the set of relevant documents is usually small compared with the collection size. The probability &rk could be estimated by the ratio df,/n, which is similar to the idfcomponent (see eqn 1). With these estimations, the previous formulation can be simplified, giving the following equation:

RSV(D,)

=

5

x,kvwqk

=~xik.[log[~]

+c]

with

c=log[k]v

(8)

k=l

in which both c and r& are constants. Once relevance information is known by the system, we may build for each query term a contingency table, as shown in Table 5. This table contains the number of relevant documents for a given query (noted R,) and, after inspecting the document representatives,

Table 5. Estimation Term T, XI = 1

of relevance

Relevant relk + 0.5

XL =o

R, - rei&+ 0.5

Total

4, +

1

weighting

of search term T,

Nonrelevant dfA - relk + 0.5 n - dfA - R, + relh + 0.5

n-R,+

1

Total

dfh+ 1 n - dfh + i n+2

J. SAVOY

524

may split the number of relevant documents in two sets denoting whether or not the keyword T, is included or not in the document surrogate (X k = 1 or x k = 0). Thus, the value relk indicates the number of relevant documents including the term T, in their surrogate. Based on such tables, Robertson & Sparck Jones (1976) explain how to introduce relevance feedback information in order to estimate more accurately the probability values rqA and Sqk (eqn 9): we

rell, + 0.5 ‘qh

=

dfk - re/,++ 0.5

and

R, + 1

syh

=

(9)

n-R,+1

When estimating a probability, we usually modify the simple ratio by introducing a constant 0.5 in the numerator and 1 in the denominator. This correction is made in order to obtain a more realistic value when faced with unrealistic small samples. For example, when R, is one, the simple ratio returns an estimated probability of one or zero, depending on whether relk is one or nil. However, when no relevant document indexed by keyword Tk can be found (!-e/k = 0), the previous formula returns a rather bad estimate of the probability ryh to 0.01 probability ‘4h. In this case, we prefer setting the corresponding (Croft, 1983). Using previous estimations, the computation of w4h is given in the following equation: reik + 0.5

R, - relh + 1 Wqk

=

log

(10)

dfk - rel, + 0.5 n - dfk - R, + 1

in which the weighting coefficient w@ known as the relevance weight (Robertson & Sparck Jones, 1976), measures the discrimination power of term Tk or the extent to which this keyword discriminates between relevant and nonrelevant documents. The probabilistic approach described so far considers only weighting search keywords and binary index terms. We may also weight index terms to obtain a more precise description of documents. The introduction of such coefficients is described in the following section.

4.2 Term significance weight In order to improve the basic probabilistic model, Croft (1983) suggests accounting for the presence of weighted index terms in document surrogates. This weight, known as term significance weight, reflects how important a keyword is in describing the semantic content of a document. In this model, the retrieval status value is computed according to the following formula:

RSV(D,)

= 2 P(s,~ = 1(D,) ‘wy/. h=l

Kt(l-K).ntAk

ifntAA>O

0

otherwise,

with P(x,h = 1 ID,) =

(11)

where Wqk is described in eqn 6, and ntf,k in eqn 2. The constant characteristics, and it is included for the following reason:

K depends

on collection

It must be remembered that term occurrence in a document IS a rare event in that only very few terms out of a large possible vocabulary occur in any given document. This implies that any non-zero significance weight should given a reasonably high estimate

A learning scheme for hypertext for the probability of assignment. estimates for these probabilities.

The constant

525

K is introduced,

therefore,

to give higher

(Croft, 1983, p. 7)

In order to understand more clearly the computation principles of the probabilistic retrieval models, Table 6 presents an example when no relevance information is available. The query term weight Wqk represents the value obtained by eqn 8. As shown in Table 6, this coefficient, when attached to a specific word, is higher than for those assigned to a frequent keyword (“tss” vs. “system”). The second part of this table denotes the influence of the weighted index term. For example, for Document 25, both eqns 8 and 11 return the same value. Generally however, the normalized term frequency ntf,,kdepicts a value other than one. Thus, the term significance weight attenuates the contribution of this matching keyword in computing the retrieval status value of a document. When full relevance information is known, the computation of the same example is presented in Table 7. In these circumstances, both probabilities r& and sqk are evaluated using eqn 9. In this example, the value of n equals 3204, R, is 5, and the relevant documents attached to the underlying request include Document 1410. Since the search term “deal” is not included in any relevant document, the value re& equals zero. In this case, our adjustment technique implies that the underlying probability r& is set to 0.01 (instead of 0.516 = 0.0833. . . ). The resulting relevance weight is negative, denoting that the presence of the index term “deal” is considered as evidence that the given document is nonrelevant. Of course, if the retrieval status of a document becomes negative, this document is not included in the ranked output list. The coefficients Wqk used by Robertson & Sparck Jones (first part of Table 7) show that the keyword “tss” has a greater importance in the user’s query than the term “system,” and this variation favors relevant Document 1410. In Croft’s model, this difference is attenuated by the indexing weights, and the impact of the keyword “tss” in relevant Document 1410 is more or less equal to the importance attached to the keyword “system” for Document 25 or 1071. In the previous examples, the parameter K seems to be set arbitrarily to 0.3. In Croft’s paper (1983), the author suggests using K = 0.5 as the default value when no tuning is possible. To define the value of this parameter more precisely, we have evaluated the retrieval performance using various values for K as shown in Table 8. For the CACM collection, the best K value seems to be 0.3. However, setting this parameter in the range [0.2-0.41 does not significantly modify the retrieval effectiveness over the results based on K = 0.3. Moreover, when K = 0, eqn 11 takes account of weighted index terms, and thus this model does not reduce to eqn 8. When analyzing the results shown in Table 8, we can conclude that weighting a query

Table 6. Initial term weighting (K = 0.3)

n - @A dfh

ryh

1 37 675

0.5 0.5 0.5

Stem Tk

tss deal system

Stem T,

WL

K+

dfh 3203/l = 3203 3167/37 = 85.59 2529/675 = 3.75

(1 -K).nrf,~

8.072 4.450 I.321

K + ( 1 - K) ‘ntf,A c(‘,,h

tss I = 1410 deal i = 1890 system / = 25 i = 1071 IPM 3124-F

0.167

0.417

3.365

0.25

0.475

2.114

1.0 0.8

I.0 0.86

1.321 1.136

J.

526

SAVOY

Table 7. Full relevance Stem rr

df~

tss

rell

deal system

37 615

Stem T,

(K = 0.3)

T,,l

L,l

I .5/6 = 0.25

I 0 5

1

weighting

0.01 5.5/6 = 0.9166

K+

nrS,h

0.5/3200 37.5/3200 670.5,3200

(1 -K).nty,k

M’,,i. = 0.00015 = 0.0117 = 0 2095

7.665 -0.160 3.726

K + ( 1 - K) n/f,,i !+,,A

t5s I = I410 deal I = 1890 system I = 25 I = 1071

0.167

0.417

3 lY6

0.25

0.475

-0 076

I.0 0.8

I.0 0.86

3.726 3.204

term using the idf formula (eqn 5) or eqn 8 (with rqk L 0.1) produces significant enhancement over the simple coordination match. These results confirm the conclusions of Croft’s study (1983). For both collections, probabilistic models based on eqn 5 or 8 yield nearly identical results. However, our results are based on full-text indexing and the study of Fuhr & Miiller (1987) shows that when using a controlled vocabulary, the idf approach does not improve the retrieval effectiveness over a simple coordination match. In our study, the term significance model of Croft outperforms, for both collections, these three probabilistic schemes. According to Table 8, the optimal setting seems to be ‘qh = 0.4, K = 0.3 for the CACM collection, and r& = 0.7, K = 0.3 for the CISI corpus. 4.3 Query formulation Before comparing our learning scheme with probabilistic models, the following differences must be noted. Firstly, the evaluation of both retrieval techniques is not based on

Table 8. Evaluation

of probabilistic

models wthout

relevance

intormatlon

Prectsion CAC‘M collection (50 qurrle\)

Model Probablllstlc

Model (eqn 4) coordination

Probabilistic

Model (eqn 5)

( w’yh

=

Probabkstlc

= (TC/h (7,/h

=

( rc,h

=

@h

)

Model (eqn 8) 0.025, J,,~ = dfk /n) 0.3, 5,/h = @-A fn) 0.4, J,,~ = dfh /‘n)

(r<,h = 0.5. S,,A = dfk /n) (r,,h = 0.7. syL = dfk In) (rqh = 0.8, s<,* = dfh /n) Croft’\

probabihstic (‘4 = 0.5, K = (r<,h = 0.5, K = (rp(h = 0.5, K = (r,,h = 0.5, I( = (ry,4 = 0.4, K =

model (eqn 1 I) 0.0) 0.3) 0.5) 0.7) 0.3)

( ruk = 0.6, K = 0.3)

(r,,/, = 0.7. K = 0.3) (r,,h = 0.9, K = 0.3)

level match

(%I change) CISI c0llectlon (35 querIe\)

27 6

25

x

28.1 (+24.7)

30.1 (+I6

9.7 (-56.9) 28.8 (+27.7) 29.3 (+29.9)

5 I(-80.1) 15.7 (-0.3) 78.3 (fY.7)

28.1 (+24.5) 27.4 (+21.3)

29 7 t+I5.2) 30 I (+I6 4)

27.1 (+lY.9)

29.7 (+I5

30.6 35.3 33.5 31.3 35.3

32.6 34.6 34.2 32.0 32.9

(+35.7) (+56.3) (+48.5) (+38.7) (t56.5)

8)

0)

(+‘6 4) (+34 2) (+32.4) (+24.0) (+27.5)

35.1 (+55.5) 34.9 (+54.7)

35.6 (+37.Y) 36. I (+50.0)

34.0 (+50.5)

35.3 (+36.9)

A learning scheme for hypertext Table 9. Example of user’s informatron Boolean and natural language Boolean

527

need expressed in form

query

OR (‘tss’ AND (‘rbm’ AND (‘time’, ‘sharing’))) Natural

language

query

“What artrcles exist which deal with TSS (Time Sharing System), an operating system of IBM computers?” Natural

language

query viewed by the system

stem

lfqh

dfh

articl exist deal tss ttme

I 1 I

20 90 31 I 412

1 I

stem share system oper tbm comput

ifqh

1 2 I I 1

dfk

98 675 331 95 855

identical query form. Our approach requires Boolean requests, whereas probabilistic models demand natural language queries. As shown in Table 9, the same user’s information need does not contain equivalent information in both forms; for example, the notion of “operating system” is not listed in the Boolean form. To be more precise, search terms are not manipulated by the retrieval scheme as written by the user, because their suffixes are removed based on a suffix-stripping algorithm into “share” in Table 9). The aim of this procedure is to (i.e., “sharing” is transformed merge or “conflate” semantically equivalent words to the same form and to keep semantically distinct words separate. However, a stemming algorithm cannot capture all morphological variations (“related” is reduced to “relate” but “relatedness” gives the stem “related”) and introduces sense mismatches (“operating” and “operation” are both reduced to the stem “oper”). Secondly, natural language queries contain more search terms and those that occur more frequently (i.e., “cornput” or “system”). Since a document is retrieved as soon as its representative contains at least one common keyword with the query, the result list of the probabilistic model will be larger than that of the Boolean model (see Table 10). Including more search terms is not always more advantageous. There will be some terms which,

whatever

the original

intention,

retrieve

a large

num-

ber of documents, of which only a small proportion can be expected to be relevant to a request. Such terms are on the whole more of a nuisance than rare, over-specific terms which fail to retrieve documents. (Sparck Jones, 1972, p. 14)

Table IO. Statistics

of retrieval

Statrstics CACM collection number of queries # of queries wrthout result list # of q. with null relevant dot. mean number of retrieved dot. CISI collection number of queries # of queries without result list # of q. with null relevant dot. mean number of retrieved dot.

models (CACM

and ClSl collections)

Boolean model

p-norm model

50 7 I4 17.86

50 0 0 582.48

50 0 0 936.08

35 0 0 888.29

35 0 0 903.49

35 0 I 159.0

Probabilistrc model

528

J. SAVOY

Thirdly, whereas the CACM collection includes the same user’s information needs in both Boolean and natural language expressions, the CISI corpus contains 35 Boolean queries and 76 natural language requests. In order to compare both retrieval systems more objectively, our evaluation for the CISI collection is based on 35 requests. Thus, direct comparisons with results reported by others works, such as Kwok (1990b), are not possible. Moreover, other studies may have used different suffix-stripping algorithms, distinct stop lists, and dissimilar interpolation methods when computing precision-recall tables. The effect of including more search terms is depicted in Table 10, which indicates that the traditional Boolean model matching process is more selective than both the p-norm and the probabilistic schemes (CACM: 17.86 retrieved documents, CISI: 159 retrieved documents). Since the Boolean operators of the p-norm model are not more strictly interpreted, the size of the result list is larger than the classical Boolean model. Moreover, natural language requests contain more search terms (see Table 9), and produce a larger ranked list. Based on these statistics, we can see than individual keywords may have powerful extractive abilities without being very selective, although they introduce many false discards. Sparck Jones (1972) shows that the mean number of matching terms for relevant documents retrieved is higher than that for nonrelevant items. Thus, a user interested in obtaining only one relevant item may inspect only the first few documents of the ranked result list. However, such a search is more the exception than the norm. Another study confirms these phenomena: We believe that resolving word senses will have the greatest impact on a search that requires a high level of recalf. This is because such searches retrieve many documents that have only one word in common with the query. Lexical ambiguity is not a significant problem in documents that have a large number of words in common with a query. (Krovetz & Croft, 1992, p. 139)

The strong selectivity of the traditional Boolean model does, however, have drawbacks. Using this approach with the CACM collection, seven requests do not generate a result list, and seven additional queries incorporate no relevant documents. In both the p-norm scheme and the probabilistic models, each query generates a result list that contains at least one relevant document. Previous statistics are of limited value to us when comparing different retrieval techniques. We are more interested in measuring the influence of the learning technique over a solution ignoring relevance information, and in comparing probabilistic retrieval models with our proposed learning scheme. 4.4 Retrieval

evahafion

tests

Generally, three different tests can be conducted when evaluating a learning scheme. Firstly, we assume that all relevant documents to every query are known (retrospective test or full relevance feedback). This information is used either to estimate the probabilities or to establish relevance links. The results obtained under these circumstances represent an upper bound of retrieval effectiveness. This can be achieved when a large number of experiments have been done and the content of the nodes is static. This first evaluation may be viewed as being of academic interest only. . . . I shall assume that the statistics relating to the relevant and nonrelevant documents are available and I shall use them to build up the pertinent equations. However, at all times the reader should be aware of the fact that in any practical situation the relevance information must be guessed at (or estimated). (van Rijsbergen, 1979, p. 114)

However, this relevance information may be used to weigh more accurately the probabilities r,k and s& or to establish various relationships between documents before the system is delivered (i.e., on CD-ROM) (Kwok, 1990b, pp. 382-383). A more realistic assumption is to employ the relevance judgment obtained with the ten best-ranked documents using the initial retrieval scheme (no relevance information

A learning

529

scheme for hypertext

available). This partial information forms a sample from which relevance links are established or relevance weights may be computed. Based on this knowledge, two further tests are conducted. In the “ten repetitive” test, the best ten documents are retained during evaluation. The queries with no relevant documents in the first ten nodes are excluded. A second test called “ten predictive” is based on the relevance information given by the previous best ten nodes. However, these ten documents are removed from the evaluation, as well as queries with zero or all relevant documents in the best ten. The results obtained under these assumptions can be interpreted as a residual evaluation. These three tests do not represent all the evaluations that we can apply. Referring to the previous citation of van Rijsbergen, we may perform an initial search without prior information, and consider the documents at the top of the ranking list obtained by this initial search process as being relevant. Under this assumption, the best ranked documents have a higher probability to be relevant, whether in fact they are or not. Based on this information, we may make a guess by estimating probabilities r# and s,# or establish relationships among documents. Table 11 shows the results of Croft’s probabilistic retrieval model, where all relevance information is known by the system (retrospective test). From it, we can deduce that probabilistic learning presents a significant improvement over an approach not incorporating relevance information. Giving the parameter K a value in the range [0.3-0.71 for the CACM collection or between [0.2-0.71 for the CISI corpus does not represent a significant difference over the optimum setting. Comparing these results with initial term weighting (Table 8), one can see that the best value for the parameter K changes from 0.3 to 0.5 for the CACM corpus, and from 0.3 to 0.4 for the CISI collection. The reason for this modification remains unknown, although this increase of parameter K has already been found in the study of Croft (1983). The results using partial relevance information obtained from the ten best ranked documents are also given. This learning sample of documents is obtained when using Croft’s term significance weight without learning (r& = 0.5, K = 0.5). The baseline solutions represent the probabilistic retrieval model without relevance information. Of course, both ten repetitive and ten predictive show results significantly better than approaches that ignore learning.

Table

11. Evaluation

of Croft’s

probabilistic

model with relevance

feedback

Precision

Model Croft’s probabilistic CACM: rgk = 0.4. without learning retrospective (K retrospective (K retrospective (K retrospective (K retrospective (K

CACM collection (50 queries)

information (% change) CISI collection (35 queries)

model (eqn 1 I) K = 0.3, CISI: rqk = 0.7, K = 0.3 = = = = =

0.0)

0.1) 0.2) 0.3) 0.4)

48.6 51.3 54.2 57.2 58.5

35.3 (+37.9) (+45.5) (+53.7) (+62.2) (+66.0)

45.7 48.8 51.3 52.3 53.4

36.1 (+26.5) (+35.2) (+42.2) (+44.7) (+47.7)

retrospective

(K = 0.5)

58.9 (+67.0)

52.5 (+45.4)

retrospective

(K = 0.7)

57.5 (+63.1)

51.2 (+41.8)

47

33

37.1 47.7 (+28.8)

38.2 48.9 (+28.1)

44

32

18.9 22.9 (+20.9)

27.7 38.3 (+38.4)

number of queries CACM: rek = 0.4, K = 0.3; CISI: ryk = 0.7, K = 0.3 ten repetitive (without learning) ten repetitive number of queries CACM: rQk = 0.4, K = 0.3; CISI: rqk = 0.7, K = 0.3 ten predictive (without learning) ten predictive

530

.f.

SAW?

4.5 Evaluation of our learning scheme In Table 12, the results of the p-norm model form the baseline from which the percentages of change are computed. For the CACM collection, the g-value is fixed to 12 for each Boolean operator (p-value = 1 when evaluating with the CISI collection; see Table 4). In these evaluations, partial relevance information has been drawn from the ten best ranked documents using the p-norm model. Based on full relevance feedback information, 8,876 relevance links have been established in the CACM collection (66,067 for the CISI corpus). Limited to the ten best ranked documents, this partial relevance feedback generates 324 relevance links (3.65% of 8,876 links) for the CACM collection (296 links or 0.45% for the CISI corpus). Our first objective is to verify whether the relevance value associated with each relevance link represents a useful link semantic that might improve the retrieval effectiveness (see Section 3, Fig. 1). As an alternative hypothesis, we suggest using a fixed default value for each CX,~(i.e., CY,~= 0.3, see eqn 3). To normalize each o(,~ between 0 and 1, we divide the relevance value of each link by the maximum relevance vaIue for a given collection (see eqn 12).

Using a fixed default value for each relevance link ((Y,h = 0.3) or a specific value (a3”“) provides similar performance for both corpora in the retrospective test (CACM: 64.1 vs. 67.1 (+4.6), CISI: 51.3 vs. 52.8 (+2.9)). Similar conclusions can be drawn for the ten repetitive and ten predictive tests. However, using a fixed value for each relevance link requires easier implementation. Concerning our learning scheme, Table 12 shows that for both retrospective and ten repetitive tests, the retrieval performances are better after than before introducing relevance information. In the ten predictive tests and for the CACM collection, we observe a significant improvement over the baseline solution ignoring learning. However, for the CISI cor-

Table

12. Evaluation

of our learntng

scheme

Precision

(Yb change)

Model (# of quertes) p-norm model (50/35 queries) retrospective (ix,& = 0.1) retrospective (to,:, = 0.15) retrospective (cy,~ = 0.2) retrospective (cu,~ = 0.3) retrospective (01,~ = 0.4) retrospective (relevance value) p-norm model (42/3 1 quertes) ten repetitive (without learrung) ten repetitive (CQ, = 0.1) ten repetitive to,& = 0.15) ten repetitive (LY,~= 0.2) ten repetitive ((Y,~ = 0.3) ten repetttive ten repetitive

(cv,~ = 0.4) (relevance value)

p-norm model (41/3l queries) ten predicttve (without learning) ten predictive (u,~ = 0.1)

60 6 63.3 64.8 64.i 62.8 67.1

50.9 52.2 52.9 53.8 53 7 53.6

38.5 (+57.1) (+64.3) (+6&t) (+66.3) (+62.9) (+74.2)

44.5 (f 14.6) (+ 17.3) f-t-19.0) (+20.9) (f20.8) (+20.6)

52.6 52 3 57.5 51.3 49.2 52 8 --_-

-18.9 49 3 49.4 49.7

38.9 t+35.O) (t34.4) (i-34 7) (+31 8) (+26.4) (+35.5)

43.5 (+l2.3) (+13.2) (+13.4) (+ 14 3)

49.3 (+13x 49.7 (+14.1)

23.1 (+19.3)

33.3 33.4 (+0.2)

19.3

ten predictive ten predtcttve ten predictive

((Y,&= 0.15) (cy,&= 0.2) (o(,~ = 0.3)

23.3 (+20.9) 24 0 (+23.8) 24.1 (+24.6)

32.7 (-1.8) 32.4 (-2.6) 31.8 (-4.5)

ten predictive ten predictive

(cw,, = 0.4) (relevance vafue)

24.1 (+x2) 23 9 (+23.3)

31.2 (-6.4) 32.1 (-3.5)

A learning scheme

for hypertext

531

pus, the ten predictive evaluation indicates that our learning scheme does not enhance the retrieval effectiveness over the baseline solution. After inspecting queries in the CISI corpus, the following explanations may be formulated. In the predictive test, the relevance links modify slightly the ranking of relevant documents, but increase more significantly the position or nonrelevant items in the result list. When the value of the parameter a,k increases, this effect is enlarged. Moreover, the partial relevance information for the CISI corpus is rather limited; 296 relevance links are inserted, which represent only 0.45% of the total. Compared to papers included in the CACM collection, CISI articles tend to be more general, and thus consist of many-faceted entities. Also, the requests of the CISI corpus tend to be more vague (i.e., “What is information science?“). Therefore, relevance links do not connect specific papers together, but they establish relationships between loosely semantically related papers. When comparing our learning scheme with Croft’s term significance weight, the following conclusions can be drawn. For the retrospective test (a& = 0.2) that indicates the search method’s potential for learning, we observe that for the CACM collection, our learning scheme seems to present significantly improved results (CACM: 64.8 vs. 58.9 (+9.1)). With the CISI corpus, however, both techniques return similar results (CISI: 52.5 vs. 53.4 (-1.7)). For the ten repetitive, when comparing our proposed approach (CQ = 0.3) with Croft’s scheme (Table 1 l), the CACM collection seems to favor our scheme (CACM: 53.8 vs. 47.7 (+ 11.3)), whereas for the CISI corpus, both models return similar results (CISI: 49.7 vs. 48.9 (+ 1.6)). In the ten predictive test, our approach (a& = 0.3) presents fairly significant improvement over the probabilistic approach for the CACM collection (24.1 vs. 22.9 (+5.0)). For the CISI corpus, however, the conclusion must be reversed (33.4 vs. 38.3 (-14.7)). Since the CISI test collection does not confirm the results obtained with the CACM collection, we cannot deduce that our learning algorithm is better than Croft’s approach. Premature conclusions and judgments of such direct comparison must be taken with caution. As explained in Section 4.3, both approaches are based on different query formulations. Moreover, one can see that basic probabilistic retrieval models emphasize weighting search terms instead of considering relationships between documents. Thus, both retrieval techniques do not perform at the same level of granularity. By analogy with physics, the probabilistic models lay stress on the components of a document; they operate on an atomic level, whereas our approach, considering words as ambiguous entities, works at a molecular level. Probabilistic schemes take account of both successes (retrieved relevant documents) and negative feedbacks. Thus, a search term like “deal” in Table 7, which does not appear in any relevant document, may be negatively weighted. Moreover, probabilistic models outlined in this paper do not distinguish between a document and its surrogates, and treat each text as a single vector of index terms. In our proposed approach, our learning scheme is based only on positive feedbacks, and documents are viewed both as a vector of weighted index terms and in their relationships with other documents. Unfortunately, both our learning schemes and probabilistic approaches are subject to difficulties in the presence of noisy information (i.e., incorrect relevance judgments).

5. CONCLUSION

This paper proposes a new learning algorithm to improve the retrieval effectiveness of the search system used in hypertext environment. In this approach, the learning scheme is implemented using the relevance link connecting nodes found relevant for a given request. During the retrieval process, these new links are taken into account to increase the similarity between nodes and query, and thus to modify the ranking of retrieved documents. Based on the CACM and CISI collections, the evaluation of the proposed scheme indicates that this approach is valid and demonstrates interesting performances compared with basic probabilistic retrieval models. These latter retrieval strategies place emphasis on weighting query terms, whereas our learning scheme establishes new relationships between relevant documents. However, other learning strategies have to be considered; for exam-

532

J.

t!iAWY

extensions of the ~robab~Iist~c retrievaf model described in Kwok (199Obf, or the neural network approach (Kwok, 199Oa). In this case, the underlying adjustment procedure is included to more accurately weigh both search terms and index terms according to relevance information that can be drawn from previous requests. If, traditionally, learning schemes are used mainly with probabilistic retrieval models, our solution may be used with various Boolean model (p-norm, fuzzy set extension, hybrid Boolean strategies) or with the vector-processing scheme. Considering the commercial dominance of Boolean retrieval systems in which significant investment has been made, our proposed solution presents an interesting perspective in such a context. representing documents both by weighted index term vectors and by their relationships with others, our approach is well adapted for hypertext systems in which users may add or remove nodes of information. Whereas the current study is focused on relevance information to establish links between papers, other studies present retrieval schemes using relationships between articles such as nearest neighbor and citation (Turtle, 199C), author’s name, bibliographic link, and computer review category {Fox et al. I I!%) or bib~iographjc link, bibiiograph~c coupIing, co-citation, and nearest neighbor (Savoy, 1993b). However, these retrieval models ignore relevance information and do not incorporate a learning algorithm in order to enhance their performance. pie,

Acknowledgements-This

research was supported by the NSERC (Natural Sctences and Engineerrng Research Council of Canada) under grant OGP0090940, by the FCAR under grant 93-ER-1557, and by the SSHRC (Socral Sciences and Humanities Research Council of Canada) under grant 4IO-92-1858. The author would like to thank the three anonymous referees for rheir helpful cuggeesrrons and remarks.

REFERENC’ES Alschuler,

L. (1989). Hand-crafted

hypertext-Lessons

from the ACM experiment.

etv of Text, Hyperte.u, H_vpermedia.and the Social Constnrrtian ofkformatron Aron, Blair, Eu&, Croft, Croft,

In E. Barrett (Ed.), The Sorr(pp. 343-361). Cambridge,

MA: The MIT Press. J, D. (1983). The program development process: Part II. The programmrng lt”um. Readrng. MA. Addison-WesIey. D.C. (1990). Languqe and re~~e~e~~~~lo~ m i~~or~ffi~~~ retrwvaf. Amsterdam, Holland: Eirevger. V. (194% As we may think. nrlanfre ~~~~~~~~Y~ $76(I), 101-108. W-B., & Harper, D.3. (1979). Using probabilistic mod& of document retrievai without relevance mformation. Journal ofDocume~~#ron, 3X4), 285-295. W.B. (1983). Experiments wxth representanon in a document retrieval ,yrtem. Informalron Technolog_v:v:

Re.ycurch & Developmen/, 2, I-21. Fox, EA..

Nunn, G.L., & Lee, W.C. (1988). Coefficients

for combining

concept

rng of The llth Inremutronal SlGIR Conference (pp. 291-307). Grenoble, Fo%, C. (1990). A ptop list for general text. SIGIR Forum, 24(1-Z), 19-35 Fuhr,

N., & Mufier.

P. (I987f.

Probabitistic

classes m a collectton. France.

search term wetghting-Some negative (pp. 13-18). New Orleans, LA. L.Mw., & Dumais, S.T. (1987) The vocabrriary

results.

Proceed-

~~c~~~~r~,~~ r# the

~~i~~~~~ton~~.SFGfR Conjerenre,

l&h

Furnas, G., Landauer, T.K., Gomez, problem in human-system corn~l~un~caijon. Corn~?~~~~l~affo~s qfrhe ACM, 30(1 I), 964-971. Gordon, M. (1988).Probabilistic and genetic algortthms for document retrieval. Commltnlrratrons offhe AC&!, 3!(10), 1208-1218 Halasz, F.G. (1988). Reflections on note tardy: Seven Issues for the next generation of hypcrmedla system<. Com-

municatrons of the ACM, 31(7), 836-852. Krot’etz,

R., B Croft,

W.B. (1992). Lexical ambiguity

and mformatlon

retrieval.

ACM-Transacttons ON Infor-

matron Systems, 10(Z), 115-14I retrieval. froceedmgs qf fhe ~~t~r~utto~?a~ Washington, D.C. Kwok, K.L. (199Ob) Experiments with a component theory of probabitistxc mformatton retrieval based on smgle terms as document components. AC&f T~Qnsffc~loff~ on r~forrnatjan Systems, s(4). 363-386. Maron, M.E., & Kuhns, J.L. (1960). On relevance, probabihrric rndexmg and informatIon retrle+al Juiirnai oj rhe ACM, 7(3), 216-244. Nle, J. (1989). An information retrieval model based on modal logic. Information Processing & Managemenl, KwoL, K.L

(1990al.

Apphcation

of neural

network

IO information

Joint Conference on :Veurui Networks Volume II (pp. 623-626).

25(S), 477-491. Robertson,

S.E., & Sparck

Jones, K. (1976). Relevance

weighting

ety for Informafron Sclence, 27(3), 129- 146. van Rijsbergen, C.J. (1979). Informalron retneval2nd ednron. Salron,

G., Fox, E., & Wu, U. (f983).

X(12), 1022-1036.

Extended

Boolean

of search term\. Journalofthe

American Socr-

London, UK: Butterworth&. information retrieval, CommrtntcatronsaffheAClM,

A learning Savoy,

scheme for hypertext

533

.I. (1992). Bayesian inference networks and spreading activation in hypertext systems. Information Processing & Management, 28(3), 389-406. Savoy, J. (1993a). Retrieval effectiveness of information retrieval systems used in a hypertext environment. Hypermedia, 5(l), (in press). Savoy, J. (1993b). Rankmg schemes in hybrid Boolean systems: A new approach. Departement d’informatique et de recherche operattonnelle, p. 46. Universite de Montreal, July, 1993. Siegel, S. (1956). Nonparametrrc statistrcs for the behavioral sciences. New York: McGraw-Hill. Sparck Jones, K. (1971). Automatic keyword classrfication for mformatlon retneval. London: Butterworths. Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(l), 11-21. Sparck Jones, K., & Bates, R.G. (1977). Research on automatic rndexmg 1974-1976. Technical Report, Computer Laboratory, University of Cambridge. Sparck Jones, K. (1981). Retrieval system tests 1958-1978. In K. Sparck Jones (Ed.), Information retrreval experlment (pp. 213-255). London: Butterworths. Smith, M.E. (1990). Aspects of the p-norm model of mformation retrieval: Syntactrc query generatron, efficrencv. and theoretrcalpropertles. Doctoral dissertation, Cornell University, Department of Computer Science, Syracuse, NY. Turtle, H. (1990). Inference networks for document rerrreval. University of Massachusetts, Computer and Information Science Department, Doctoral Dissertation, Technical Report. COINS Report 90-92. Williamson, D., Williamson, R., & Lesk, M. (1971). The Cornell Implementation of the SMART System. In G. Salton (Ed.), The SMART retrieval system-Experiments in automatic document processing (pp. 12-54). Englewood Cliffs, NJ: Prenttce-Hall.

A learning scheme for information retrieval in hypertext

A learning scheme for information retrieval in hypertext

Recommend Documents