Extended Boolean query processing in the generalized vector space model

Extended Boolean query processing in the generalized vector space model

fnformmm Sysrems Vol. 14, No. 1. pp. 47-63, 1989 0306-4379/89 $3.00 + 0.00 Copyrigh: C, 1989 Pcrgamon Press plc Printed in Great Btitatn. All ri...

2MB Sizes 0 Downloads 49 Views

fnformmm

Sysrems

Vol.

14, No. 1. pp. 47-63,

1989

0306-4379/89 $3.00 + 0.00 Copyrigh: C, 1989 Pcrgamon Press plc

Printed in Great Btitatn. All rights reserved

EXTENDED BOOLEAN QUERY GENERALIZED VECTOR S. K.

M.

WONG’,

W.

ZIARKO’,

PROCESSING IN THE SPACE MODEL

V. V. RAGHAvAN2t and P. C. N.

WONG’

‘Department of Computer Science, University of Regina, Regina, Saskatchewan, Canada S4S 0A2 and 2The Center for Advanced Computer Studies, University of Southwestern Louisiana, P.O. Box 44330, Lafayette, LA 70504, U.S.A. (Receioed 24 Ocfober 1986; in revised form 20 June 1988)

Abstract-An information retrieval model, named the Generalized Vector Space Model (GVSM). is extended to handle situations where queries are specified as weighted Boolean expressions. It is shown that this unified model, unlike currently available alternatives, has the advantage of incorporating term correlations into the retrieval process. The query language extension is attractive in the sense that most of the algebraic properties of the strict Boolean language are still preserved. Although the experimental results for the proposed extended Boolean retrieval are not always better than the vector processing method, the developments here are significant in facilitating commercially available retrieval systems to benefit from the vector based methods. It is shown that relevance feedback techniques can be employed in this extended Boolean environment and, for both document collections tested, significant improvements over the initial search are obtained after the modification of queries via feedback. The proposed scheme is compared to the p-norm model advanced by Salton and co-workers. An important conclusion is that it is desirable to investigate further extensions that can offer the benefits of both proposals.

ously known as keywords, index terms, subject indicators, descriptors or concepts. This indexing process would require the application of some automatic or manual indexing technique to either the full texts or some surrogates of the documents in order to identify the index terms to be used in their representation. In addition to the selection of index terms to represent documents, it is also common to associate weights that reflect the importance of each term as an indicator of the content of the documents to which it is assigned. The user request may, on the other hand, be in the form of a natural language statement or a Boolean expression. In the former case the query may be represented within the IR system as a set of (index term, weight) pairs [1,4]. The retrieval operation often begins with matching sets of index terms assigned to the stored documents with the keywords representing the user query. The matching is followed by the retrieval of those documents whose set of content identifiers meets the conditions specified in the user query. The system can also obtain feedback from the user as to the appropriateness of the initial set of retrieved documents, and use it to obtain an improved query formulation. This kind of operation, known as relevance feedback, is particularly helpful in obtaining more effective retrieval output [ 1, $61. In the past, several mathematical models for document retrieval systems have been developed [l, 4, 7-l 11. These models are used to formally represent the basic characteristics, functional components, and the retrieval processes of document retrieval systems. Two basic categories of models that have been employed in information retrieval are the vector processing models and the Boolean retrieval models.

1. INTRODUCTION Information Retrieval (IR) is a discipline involved with the organization, structuring, analysis, storage, searching and dissemination of information. IR systems are designed with the objective of providing, in response to a user query, references to certain items from a given collection of information items depending on the information desired by the user. In other words, the system suggests which documents the user should read in order to satisfy his (her) information requirements [ 11. A document may or may not be judged relevant to a user query (i.e. satisfy the information need) depending on many variables relating to the document (e.g. its scope, how it is written) as well as numerous user characteristics (e.g. why the search is initiated, user’s previous knowledge). Since many factors may influence the judgement concerning relevance in a complex way, it is easy to see that designing an IR system within this frame of reference is very challenging. In fact, it is unrealistic to expect an IR system to be able to retrieve only and all relevant documents. Accordingly, appealing to utility theory, a number of IR researchers take the view that systems should facilitate the ranking of documents in descending order of their estimated usefulness to a user query [2, 31. In order to achieve such a ranking some method for representing what the documents are about (i.e. knowledge representation of documents) is needed. It is common, in IR. to suppose that each document is indexed by a set of content identifiers that are varitTo whom all correspondence

should be addressed. 47

48

S. K. M.

In the conventional vector space model (NM) proposed by Salton [I], the (index term, weight) pairs are used to provide a vector representation of the documents and the queries. To achieve this, vectors associated with index terms are assumed to constitute the basis set of a vector space. Then each document or query is represented as a linear combination of the term vectors in the basis. The retrieval operation consists of computing the cosine simiIarity function between a given query vector and the set of document vectors and then ranking documents accordingly. in this approach, the interpretation that the occurrence frequency of a term in a document represents the component of the document vector along the term vector corresponding to the given term, is made. The advantages of this model are that it is simple and yet powerful. The vector operations can be performed efficiently enough to handle very large collections. Furthermore, it has been shown that the retrieval effectiveness is significantly higher compared to that of the Boolean retrieval models [I2-141. However, this vector model has not been incorporated into commercial systems to any great extent. In strict Boolean retrieval systems (I, 41 the user query normally consists of index terms that are connected by Boolean operators AND, OR and NOT. The advantage of using Boolean connectives is that they enable the user query to possess greater structure than a query that specifies only a set of (index term, weight) pairs. In other words, the user can express his/her needs more flexibly. However, the expressive power of the query language notwithstanding, the search result is still not adequately discriminated. That is, the documents selected for retrievai are not ranked according to any order of presumed importance, and hence are deemed by the system to be equally relevant. This is a direct consequence of the fact that the document representation is binary, indicating only the presence or the absence of the various index terms. Under these circumstances, one can achieve some control over the output by varying the query specification in order to restrict or expand the search results. But, relative to the models that allow weighting of terms, the ability to control search output is still very limited [IS]. We believe that this inability of strict Boolean systems to provide a finer distinction among retrieved documents is a major contributing factor in their poor performance ois-huir models based on weighted representation schemes. One of the aims of research investigations in information retrieval has been to make the adoption of the vector processing models in commercial systems more attractive. A key difficulty in this regard is due to the inability of the vector processing systems to handle Boolean queries. In recent years some progress has been made in the integration of Boolean and vector processing systems [12-14, 161. If more and better ways to achieve this are advanced, we will be better able to provide a setting in which existing Boolean systems can use vector processing

WoNe er al. techniques without a great deal of additional cost and effort. Another problem in the conventional vector space model is that it assumes that term vectors are orthogonal. Since it is generally agreed that terms are correlated, it would be desirable to generalize the model to incorporate term correlations. A vector processing model termed the GVSM [I?, 181. in which the relationship between terms were accounted for, but the queries were still assumed to be represented as a list of terms and corresponding weights, was proposed in response to this need. That is. initially, no provision was made for processing Boolean queries in such an approach. However, the premises of the GVSM were found to naturally lead to a new scheme for handling Boolean queries, In this paper we present the details of this scheme. It is hoped that this result will be a step towards the aim of integrating vector processing capabilities into existing systems that use the standard Boolean retrieval model. This paper is organized in the following manner. In Section 2 we review the main characteristics of the GVSM. In Section 3 its connection to the strict Boolean retrieval model is explained and the characterization of strict Boolean retrieval system in the vector model environment is presented. This vector model for strict Boolean retrieval system is then generalized to handle weighted queries and documents. In Section 4 the various ideas and models presented in the earlier sections are summarized. Then, in Section 5, the proposed scheme is compared with the p-norm model [12-141. This section also presents experimental results which show that the proposed scheme is effective. In addition. some experimental results that demonstrate the feasibility of implementing relevance feedback, within the proposed model for extended Boolean retrieval, are presented. The final section offers some concluding remarks and areas for further research. 2. REVIEW

OF THE GVSM

Our approach, to integrating the vector space model of information retrieval with the Boolean retrieval model, rests on establishing an analogy between elements of a Boolean algebra and elements of a vector space. This is done in subsections 2.1 and 2.2. The correspondence so established leads to a formal specification of documents and query. as describe-d in subsection 2.3.2. Subsequently, a way of thinking about the correspondence between orthogonahty among vectors, on the one hand, and the lack of co-occurrence of the associated concepts, on the other hand, is developed. This development allows the explicit representation of terms as vectors in such a way that the degree of relatedness between terms can be computed as the scalar product of the corresponding term vectors. These details are presented in subsection 2.3.3. The example in subsection 2.3.4

Generalized vector space model

consolidates the various mechanics of the GVSM.

ideas

and

clarifies

the

2.1. Vector representation of elements in a Boolean algebra It is common, in defining a vector space, to specify a set of basis vectors and then to provide the operations through which the various elements of the space can be generated from the basis vectors. Thus, the first step, in establishing a connection, is to recall that any element of a Boolean algebra can be generated starting from a set of atomic expressions or minterms. Following that, we show how the elements and operations of Boolean algebra can be mapped to the elements and operations of a vector space. Let x,, x2, . ,x, be n literals used to generate the free Boolean algebra, denoted B,.. Any Boolean expression composed of these literals (using operators A ND. OR or NOT) is an element of the algebra. What we desire is to show that a vector space, in which every Boolean expression in Bin can be represented by a vector, can be defined. In a vector space it is necessary to specify a set of vectors, that forms a basis. Clearly, if a basis is known then any vector in the space can be expressed as a linear combination of the basic vectors. Since the intent is to obtain a vector representation for every possible Boolean expression, it is appropriate to have the basis vectors corresponding to the set of fundamental expressions which can be combined to generate any element of the algebra. We, therefore, employ the notion of an atomic expression. An atomic expression, or a minterm, in the n , x, is a conjunction of the literals literals xi,. x2, where each x, appears exactly once and is either in complemented or uncomplemented form. That is,

Given these, it is easily seen that the vector representation of any Boolean expression is given by the vector sum of those vectors in the basis that correspond to the minterms in the disjunctive normal form of the expression. The assertion that, for any two vectors m,, m,, the scalar‘product m, ‘m, is zero corresponds to the fact that the conjunction of atomic expressions m, and m, is false. In general, if two vectors are not orthogonal, then the corresponding Boolean expressions have at least one minterm in common. 2.2. Vector

representation

of

terms

with binary

weights

The ideas developed in Section 2.1 can be applied to an information retrieval environment, where each index term can be given an explicit vector representation. Let the indexing vocabulary consist of n terms and t,, tz, . . , t, denote the literals associated with these terms. Any literal can appear in a Boolean expression either as c or ti, depending on whether it needs to be complemented or not. In particular, conjunctive expressions where every literal appears in either uncomplemented or complemented form are the atomic expressions. Let {m,},, denote the set of all atomic expressions. Then, since each t, is itself an element of the Boolean algebra generated, t, can be expressed in its disjunctive normal form: t, = m,, OR rn,>. * OR m,,,

(2)

where the m,,s are those mintertns in which t, is uncomplemented. Let the set of minterms in equation (2) be denoted by {m}‘. We can now define the vectors in the basis set analogous to equation (l), and the term t, can be written in the vector notation as

m,=x;‘ANDxq:AND...ANDx’,,

where e, = 0 ore, = 1 with xp = NOTx,

49

t,=

1 mk. mtEImP

(3)

Alternatively,

and xf = x,.

2”

It is well known that the conjunction of any two minterms is always zero (fake) and that any Boolean expression in the literals x1, x2,. . . ,x, can be uniquely expressed as a disjunction of minterms. The representation obtained in this way is known as the disjunctive normal form [19]. Let {mi 1:. denote the set of minterms in B,. In order to characterize a vector space in which these minterms are corresponded to the basis, we define a set of 2”-dimensional vectors (m,],,. These vectors constitute an orthonormal basis of the vector space R2” as follows:

ti=

1

ctk mk

(4)

,=I

where C,h=

1 if mkE{m}’ i 0 otherwise.

m,=(l.O.O

. . . . . 0)

That is, the term vectors are a linear combination of the mks, the basis, and the vector summation operations in equation (4) correspond to the OR operations of equation (2). Furthermore, the scalar product between any two vectors in the basis is zero corresponding to the fact that the ANDing of two minterms is false.

m,=(O.l.O

. . . . . 0)

2.3. The generalized vector space model (GVSM)

m3 = (0,O. 1, . (0)

m?, =(O.O.O . . . . . 1)

(1)

In this section we will review the essential features of the GVSM [17, 181. This model is the result of incorporating the idea developed in Section 2.2 into the framework of the conventional vector space

S.

50

K. M. WONG er al.

model. One of the main steps in this process involves the generalization of the term vector representation in such a way that the expansion coefficients in equation (4) are not necessarily binary numbers. The determination of these coefficients is, however, closely tied in with the question of what is meant by two terms being non-orthogonal (or, correlated). This is because once the coefficients are specified, the scalar product between any two non-binary vectors ti and tj is defined. Since scalar product being zero implies orthogonality, a non-zero value must represent a measure of non-orthogonality. In order to motivate the premises of the GVSM and to introduce some basic concepts, we first outline the main ideas of the conventional vector space model in the following section. 2.3.1. The conventional vector space model. The basic premise in the conventional vector space model (VSM) is that documents and a query are represented respectively by a set of p vectors, {d&x = 1,2,. . . , p }, and q in a vector space spanned by the set of n normalized term vectors, {t,li=1,2 ,..., h}. That is, d,=

i

(a=1,2

azrtf,

,...,

p),

n-1

q =

i

(5)

(6)

qjtj.

/-I

Given the above representation,

Similarly, equation

(7) can be rewritten as S =qGA’,

(10)

where S=(d,.q,dz.q,...,d;q),

q=(q,,q2,...,qn)r

and r

t,*t,. . * t,.t,

6. t,

1

In the conventional vector space model, the term occurrence frequency matrix obtained empirically from automatic indexing is assumed to be the matrix A. Since the correlations between terms are not known a priori, as a first order of approximation the term vectors are assumed to be orthogonal to each other. Thus the correlation matrix G, defined in equation (1 l), becomes an identity matrix. That is, g, = ti’ t, =

1 ifi=j i0

ifi#j.

With such approximations (i.e. G = I), the ranking matrix S for a given query q with respect to a set of document vectors can be computed easily from the following matrix equation, S=qAT.

the scalar product

(12)

d;q (the cosine of the angle between d, and q, if both

vectors are normalized), which may serve as a measure of the similarity between each document in {d,}, and the query q, is defined by d;q =

i I-

(E = 1,2,. . ..P).

4,q,t;t,,

(7)

1.1= I

It is assumed that the higher the similarity value of a document to a query, the greater is its potential to be relevant to the user query. The retrieval strategy therefore consists of obtaining ranked output of documents in decreasing order of the querydocument similarities, so that documents with higher similarity values will be retrieved first. However, for this purpose it is necessary to know both the correlation between any pair of vectors in the basis, P,L and the components of documents and queries along these vectors. It is convenient for subsequent discussions to express equation (5) in matrix notation as follows:

2.3.2. Document and query representation in GVSM. From equation (5), it is seen that in VSM the representation of a document is taken to be a vector sum of term vectors. In GVSM, we introduce the circle sum operator (denoted by @), and hypothesize that a document should be expressed as the circle sum of the associated term vectors. More precisely, d,=a,,t,Qa,,t,~...~a,t,=

(8)

where A ’ is the transpose of A, D = (d,, d2, . . , d,), t=(t,,t, ,..., t,),and

.

.

! aPI ap2 . . ap.

(9)

(13)

where the circle sum @ is defined as follows. Let t, and t, be vectors having the form specified in equation (4). Then, (14) where K = {m}‘U {m}‘. This choice for the representation of ds is intended to reflect the fact that the summing coefficients C,~S may not, in general be binary. Let us consider, for example, the situation where terms are not correlated. That is,

4, a12. ‘4, A =a?! af2 ’ay 11 D = tA’,

6 a,,$, r-1

t;t, = 0,

implying that there is no basis vector that appears in the expansions of both t, and t,, 1 S r, s
Generalized vector space model generalization of the hypothesis that each document vector may be expressed as a vector sum of the ms. A reader knowledgeable in the area of fuzzy set theory may immediately appreciate the motivation behind this definition by thinking of t, and t, as fuzzy sets and by imagining c, and c,~ as being the degree of membership of mli to respectively t, and t, [7, 10,20,21]. It is proposed that the coefficients of the ms in the document representation” should be obtained by first prescribing an expansion for each term vector and then by using equation (13). It is also necessary to specify the way in which a query will be represented. Given q = (q,, q2, . , 9,). we propose that the query be represented as a vector sum of the tis involved. That is,

51

(which denotes the total number of documents containing t, and t2) thus provides a plausible measure of the unnormalized correlation between t, and t,. In terms of vector notation, the correlation between t, and t,, denoted by t, .t,, can be conveniently expressed as a scalar product of two normalized term vectors, t, and t2, namely, t, . t2 =

c’(D,,J [C*(D,,i,) + ~*(D,,,~)l"*[C*(Di~r~)+ c*(Dr~~~)l”*’

where

+ c(D,Jm2

t, = c(D,,i,h

[C*(D,,i2) + c*(D~~~~)I"*’

c(D,,,Jm, + C(Di,,,h

9 = f:

9jt,.

(15)

t2= [C*(Di,,,) + c*(D,,,,)]“~



,=I

This choice is made since, in the VSM, the user does not imply any structure in the query specification. As we shall see later, when it is necessary to incorporate more structure, queries will not be given by the above expression. Using these prescriptions, both documents and queries can now be expressed as a linear combination of the ms and the computation of d;q is straightforward. All that still remains is to show how equation (4) is generalized to express each t as a vector sum of ms. As mentioned earlier, this requires that the meaning of term correlations be made precise. 2.3.3. Computation of term correlations, First a simple example is presented to motivate the approach adopted. Example 2. I

Let D be a set of documents indexed by either term I, or term t,; that is t, and t2 constitute the indexing vocabulary. Let D, be the maximal subset of documents satisfying F, where F is a Boolean expression in the ts. We can identify the following disjoint subsets forming a partition of D: D,, i2= D,, n D,, 3

4,: = D,,nD,,, 4,: = &,nQ>r

where Dt,ilv D,,,,, Di, ,2 correspond

respectively to s are dropped for notational convenience. Let 6,, denote the set complement of D,, (i.e. Ij,, is the maximal subset of documents not containing t,). Based on intuition we argue that the correlation between any two index terms depends on the number of documents in which these two terms appear together. This sort of argument based on term co-occurrence information has also been the basis for measuring term correlations in earlier studies [ 11,221. Let c(DF) denote the cardinality of the set D,. The cardinality c(D,,,:) of the subset D,,,! = D,, nD,> D ,, AND

;z 1

D r,ANDt2’

D i, AND ,z. The AND

basic and m,, m2 and m3 are the orthonormal vectors. 0 It is evident from the above example that terms can be meaningfully expressed as linear combinations of ms. Note that m, , m2 and m3 correspond respectively to the atomtc expressrons t, t2, t, t2 and c t2. In this example, only the presence or absence of a term in a document is considered. This limitation is reflected in the assertion that c(D,,,,) is a measure of the correlation between tl and t,. Furthermore, the example helps in convincing oneself that the expansion of a term vector, say ti, need not have a non-zero coefficient for all basic vectors corresponding to the minterms in {m}2n. This is due to the fact that term co-occurrences depend on the particular collection of documents at hand. For instance, if t, and t2 do not co-occur in a given collection then neither of the expansions oft, or t, will involve m,, and t, . t2 will be zero. More generally, given terms t,, t2, . . , t, and a collection of documents of cardinality p, the active minterms only constitute a subset of minterms in {mj2”. Since, in the worst case, each document can correspond to a different minterm, the size of the basis set is at most p. Thus, the expansion of t,, 1 < i G n, involves only the vectors associated with the set of active minterms. It is understood, in subsequent discussions, that {m}’ refers to just the subset of the active minterms in which the literal t, appears in uncomplemented form. Another issue raised in the discussion above is the limitation of using only the cardinalities. A natural generalization, which considers the importance of a term to the documents (i.e. term weights), should be developed. In [17], the following expression for (unnormalized) t, is proposed:

where the unnormalized C ‘lr =

form of c,, is given by c

d,6D,,

a,,

(17)

S. K. M. WONGer al.

52

2.3.4. A simple illustration. The above concepts are illustrated by the following example. Example 2.2 Suppose we are given a set of documents D = {d, , d,, d,, d,} indexed using the indexing vocabulary T = {t,, t,, t,}. The weights of each term in the documents are given by the following matrix:

4

r

II

t2

t3

2

0

1

By substituting the above expressions vectors into equation (13), we obtain:

for term

d, = 2t, 0 t,

= 2(0.55m, + 0.83mz)@(0.32m,

+ 0.95m,)

= 1. lOm, + 1.66m2 + 0.95m,, d, = t, = 0.55m, + 0.83m2, d, = t, @ 3t3 = m3@ 3(0.32m, + 0.95m3)

= 0.96m, + 2.85m3,

1

d4 = 2t, = 2(0.55m, + 0.83m,)

A

= l.lm, + 1.66m,.

Without loss of generality, the 8 fundamental products or minterms can be represented as follows: _m2 = t, t2 1, ,

= tlf2t3,

4

Similarly, we can transform, for example, the query vector, q = t, + t?, into a linear combination of atomic vectors,

m3 = tl t2t3,

m4= tlt2t3,

m,

m6

q = (0.55m, + 0.83m,) + (ml) = 0.55m, + 0.83m2 + m, .

__ =

4 t2 1,) -

_

=

tl f2 t3, __-

ml = t2r2t3,

m, = t, t2t3.

Each ti E T can be expressed in a disjunctive normal form: t, = t, AND (t2 OR t;) AND (t3 OR i;)

Then, the cosine similarity s: = (d;q)/(l$lIql) between the document d, and the query q, can be computed as follows: s; =

,/l.102 + 1.662 + 0.952 ,/‘O.552 + 0.832 + l2

= [(t, AND t2) OR (t, AND r,)] AND = (t, AND (t, AND (1, AND (t, AND

tz AND t; AND t2 AND r, AND

= 0.9248. s, = (0.55)(0.55) + (0.83)(0.83) + (O)(l)

t,) OR t3) OR



i,) OR

,/m,/0.552+0.832+ (0.96)(0.55) + (‘W.83)

t,,

12=o’7056. + (2.85)(I) = o 7819

S’=,f~J0.552+0.832+

=m,0Rm,0Rm,0Rm2. By considering {mI’= {ml)

(1.10)(0.55)+(1.66)(0.83)+(0.95)(1)

only the active minterms, we write

(1.1)(0.55)+(1.66)(0.83)+(O)(l) s;=,/~~0.552+0.832+12

t,=t,AND(t,ORT,)AND(t,ORr;) =m,0Rm30Rm,0Rm,, and, therefore, {m}’ = {m,). Similarly,

.

I2



= o 7056 ’



values similarity these cosine Based on (s; > s; > s; 2 s;), d, will be retrieved first, d3 second, and so on. 0

t, = t3AND (t, OR r,) AND (t, OR T,) = m4 OR m, OR m3 OR m6, and {m}‘= {m,,m,}. From equations (16), (17) and the above document matrix A, we obtain t, =

2m, +(1 +2)m2+Om3 JK7

= 0.55m, + 0.83m2, t =Om, +(O+O)mz+ lm, -=m3, JT’; t3 =

Im, + (0 + O)m, + 3m3 JiGT

= 0.32m, + 0.95m,.

3. PROCESSING BOOLEAN IN THE GVSM

QUERIES

In GVSM, a query is simply defined as a weighted vector sum of term vectors. Unfortunately, the choice of this form of query does not allow the user to explicitly specify query structure as can be done in Boolean systems. Clearly, it would be advantageous to provide a prescription to handle Boolean queries in the GVSM environment. In this section we develop such a prescription. In order to demonstrate that the proposed scheme is natural and sound, we first characterize the strict Boolean retrieval system within the framework of the generalized vector space model.

Generalized vector space model

3.1. Stricr Boolean retrieval in the G VSM environment A standard approach to describing Boolean retrieval environment is to characterize each document as a set of terms, and to formulate queries as Boolean expressions on the literals associated with the terms. Now, if a document d, is defined as the set of terms, it,, , ta2,. . . , I,+ >, then da is said to satisfy a given query provided that the Boolean exljression of the query evaluates to 1 by substituting a value 1 for the variables t,, , taz,. . , t, and a value 0 for the remaining variables. An alternative way to interpret this retrieval strategy is to regard each document as being represented by an atomic expression. That is, each document is represented by the minterm in which exactly those terms contained by the document are unnegated. In subsequent discussions, the minterm that is associated with a document in this manner is referred to as the dominant atom of the document. Then, a document is said to satisfy a query only if the dominant atom of the document appears as one of the minters in the disjunctive normal form of the query expression. In terms of the above characterization of documents, the embedding of the strict Boolean retrieval environment to the vector space environment is straightforward. Specifically, each document is mapped onto the particular basis vector mk that corresponds to the dominant atom of the document. On the other hand. a query (a Boolean expression) is transformed into a vector sum of the basis vectors corresponding to those minterms that appear in its disjunctive normal form. The addition and muitiphcation operations over the scalars is defined to be the same as in Boolean Algebra. In this framework, the retrieval operation involves the application of scalar product between the various document vectors and a given query vector. The documents having a scalar product of 1 would be considered retrieved. Note that the representation of a document by a single vector from the basis is a special case of that adopted in CVSM. In GVSM, a document is represented by a vector sum of basis vectors. That is.

For the strict Boolean environment, al1 the c,,s, but one, are zero. That is. d, is assumed to be modeled by its dominant atom. This discussion highlights the point that. in the most natural transformation from Boolean retrieval environment to a vector model based on the developments in Section 2.2, the documents vectors end up being treated as orthogonal to each other. In other words, our hypothesis in GVSM. that documents should have representation as a vector sum of ms, is a generalization that recognizes the fact that documents may not be considered to be unrelated to each-other. Conversely. when we reduce the variation of the GVSM (to be developed in the

53

following section) to the special case of the vectors being orthogonal, we shall obtain alent of the strict Boolean environment analysis brings to focus what we perceive crux of the Boolean model’s weakness.

document the equivThus our to be the

3.2. Extended Boolean environment under the GVSM In this section we extend the ideas presented above to obtain a scheme for expressing weighted Boolean queries as vectors. As far as the documents are concerned, the representation is the same as that for the GVSM. That is, d, = 6 a,,t,. ,=I Before we describe our method for transforming weighted Boolean queries to their vector counterparts, a query language consisting of weighted terms linked by AND, OR and NOT connectives is introduced. From a set of index terms, {t,li = 1,2,. . . .n), a s~rn~le query is defined as an ordered pair q,=tt,,M.,).

(O
where w, denotes the term weight. A query can be formed from a set of {q,,q2,. . .}, linked by the AND, connectives. For example, a complex written as

general (mixed) simple queries, OR and NOT query q can be

q = INOTq, AND (q2 OR &)I OR %

x OR (r4, w4). The next step is to define a suitable mapping which transforms such a query q to a vector in R’“. This process can be described in two stages. First, it is necessary to define the following operators @, 0 and -I in a vector subspace R2”, which correspond respectively to OR, AND and NOT. The motivation for this step is that we need operators that can combine vector queries, corresponding to the Boolean connectives that combine simple queries to form mixed Boolean queries. Let

where (m>’ denotes the set of active minterms pertaining to t,. Note that A4 is the set of active minterms which span all the terms in {t }“. In the previous section, which advances a way of generalizing a vector processing system, it has been shown how each term vector can be expressed as a linear combination of basis vectors in the vector subspace R*“. That is,

where the ciks are normalized coefficients such that 0 < c,, < I. Given such a repre~ntation of term vectors, the needed vector operators are defined below.

S. K. M. WONG et al.

54 (i) The unary operator as follows: iti=I-&=I-

1 on a vector ti is defined

C mkE(4’

cikm,

-Ci,)mk+ =

mk&

mkv

x

(’

(18)

mkeM-(mtl

where I=

c

mk.

mk*M

(ii) Given any two term vectors t, and tl, given by

c2k mk9

% e IW

the disjunction

operator

t,cBt,=

(iii) Similarly, defined by

1 m~s(m)lu(mp

@ is defined by max

(cIhT

the conjunction

c2k jmk.

(1%

I(t,, ~1) AND (12, bvz), wdl operator

@ is

The definitions of @ and 1 given above are straightforward. On the one hand, they are considered reasonable on the grounds that they are similar to definitions used in generalizing from set theory to fuzzy set theory [7, 10,20,21]. On the other hand, as we shall show later, the choice is justified by the quality of the results that we are able to obtain through their use. Specifically, two kinds of results are of interest vis-ri-vis this kind of generalization: (i) the extent to which Boolean algebraic properties are preserved by the operators, and (ii) the evaluation of the degree of improvement in retrieval performance, through experimentation. Both lines of justification will be provided. In earlier work, the operator analogous to the circle-product operator, 0, has been defined by min. We instead propose max; but, as usual, the specification that the summation in equation (20) be restricted to those basis vectors common to both t, and t2 is still retained. Researchers criticizing the fuzzy-set model have argued that using the min function unduly subdues the importance of the common elements [12,21]. We believe that the use of max function as defined by equation (20) would alleviate this problem. Having defined the operators 1, @ and 0, the next stage is to introduce a mapping, h : {q) -+ R2” (which is used to transform weighted Boolean query expressions into vectors) as follows: (a) h(r,, w,)

= w,ti

=Nq,)@h(qd

(4 h[q, ANDA= h(q,) 0 Mq,) .

OR [(tj, ~3) AND (t4r w,),

Example 3. I

Consider the following term vectors: t, = 0.4m, + 0.2m2 + O.lm,, tz = 0.2mz + 0.3m,, t, = 0. Im, + 0.3m,, {m}‘=(m,.m2,m4}, {m)Z={mz.m,),

{m)‘={m,,m,), Note that M = {m,,m2,m,, m4}. Based on our formalism, the results of the transformation of Boolean queries into vector queries are summarized in Table I. Some of the entries in the table are explained in detail below. Consider. for example, a simple Boolean query, (r,, 0.3). By the mapping h defined in equation (21), such a query can be transformed into the following vector:

=

(21)

w,)l,

where w,, wg are the weights for the phrases with terms [r,, t2] and [r3, r,], respectively. This property may be particularly important to provide an environment where a Boolean expression can be formulated automatically from the initial natural language statements [14]. The transformation of an extended Boolean expression into a vector query can be illustrated with an example.

h[(r,, 0.3)] = 0.3t,

@I h[NOT (411 = 1 h(q) (~1 h[q, OReI

0,

However, these two properties are not essential for the retrieval operations. Furthermore, the GVSM approach for extended Boolean system can also be applied to a situation where different weights of importance must be assigned to phrases or query clauses. For example,

and c

h[(t,, WI1AND [NOT (lit ‘01)I1f

Ml,, WI1OR INOT Cl,1WI111Z I.

t, = *zm), ‘It mk

t2 =

Based on the mapping h defined above, the extended query language satisfies the algebraic properties of idempotency, commutativity, duality, involution, associativity and distributivity (with one exception, which is brought out in Example 3.1). The proofs of these properties can be found in [23]. In the extended Boolean system under GVSM, the complement laws that hold for -the strict Boolean expressions are not valid, that’*is,

(0.3)(0. Im, + 0.3m,)

= O.O3m, + O.O9m,.

Generalized vector space model

55

Table I. Examules of Boolean auer~ transformation Vector aueries Boolean query expressions

ml 0.4

(1,) 02) 0,) (I,. 0.2) U2.0.5) (II, 0.3) NO7+ (I,. 0.3) (I,, 0.2) AND (t2, 0.5) (I,, 0.2) AND (tx, 0.3) (t>, 0.5) OR t,. 0.3) (I>, 0.5) AND (t,, 0.3) ([I>. 0.5) or (tj. 0.3)]. 0.8) (I,, 0.2) AND [(I,, 0.5) OR (I,. 0.3)] [(t,, 0.2) AND (I>, 0.5)] OR [(I,, 0.2) AND (t,, 0.3)] (I,, 0.2) OR [(t,, 0.5) AND (t,, 0.311 [(I,, 0.2) OR (I~, O.S)] AND [(t,, 0.2) OR (t,, 0.3)]

Applying obtain

equation

(18) for the NOT

h[NOT(t,,0.3)]= (1 -O)m,+(l

operator,

we

- O)m,

+(I - O.O3)m,+(1-O.O9)m, =m,+m,+0.97m,+0.91m,. A more general query, for example, (t,, 0.2)AND (I,, 0.3), can be transformed into the vector:

0.08

1.0

m2

0.3 0.1 0.15 0.03 0.97

1.0 0.1 0.1

0.08

m4 0.1

0.04 0.1

0.15 0.15 0.12

0.08 0.1 0.1 0.04 0.1

0.3 0.02 0.09 0.91 0.09 0.09 0.07 0.09 0.09

0.15 0.15

0.02 0.09

from the above transformation can be used to compute the query-document similarity values. 4. A SUMMARY OF GVSM AND ITS VARIATIONS The ideas developed in the last two sections are summarized. The basic premises of the GVSM model are characterized in how terms and documents are represented. That is,

h[(t,, 0.2)AND (r3,0.3)J t, =

= (O.O8m, + O.O4m,+ O.O2m,)

ml

0.2 0.2

1 ci, mk, 4s ImP

@(O.O3m,+O.O9m,) d,= 6 %A. ,-I

= max (0.02,0.09)m, = O.O9m,. Similarly, equation (19)can be used to evaluate an OR-query , (1~,0.5) OR (t,,0.3). We obtain h((r,, 0.5) OR (11, 'X3)1 = (0. lm,+ 0.15m,)@(O.O3m, + O.O9m,)

9=

= 0.1m, + max(O.lS,O.O3)m, + O.O9m,

rule of the

(ii) A query is specified as a weighted Boolean expression. The query vector corresponding to such a Boolean query, q, is defined by the mapping h, namely, q = h(q).

h[(r,,0.2) AND [(1?,0.5)0R (t3,0.3)]] = h[[t,,0.2)AND (t?,0.5)] OR [(f,,0.2)AND (r3,0.3)]]. However, the distributive rule does not hold for OR over AND as shownbelow: h((r,.0.2) OR [(tz,0.5)AND (tj,0.3)]] # h[[(t, ,, 0.2)OR (?I, 0.5)l AND [(1,,0.2)0R (t,,0.3)]].

C 4jtj. 1-l

= O.lm,+ O.l5m,+ O.O9m,. It can also be shown that the distributive AND over OR holds. For example,

Two important variations are possible depending on whether the users need to have the option of structuring queries. The two cases are: (i) A query is specified as a list of (index term, weight) pairs. In this case we have, ”

0

Thus, using the method mentioned in Example 2.2 of the previous section, any query vector obtained

The first formulation is the. basic GVSM and the second one characterizes the unified GVSM. The latter makes it possible to process extended Boolean queries within our framework. An important aspect of the basic GVSM is that it generalizes VSM to incorporate term correlations. In fact, we have pointed out that, when terms are assumed to be orthogonal, document representation reduces to the vector sum of terms. Similarly, the unified GVSM reduces to the strict Boolean retrieval model when each document is represented by its dominant atomic vector. Of course, identifying each document with a single atomic vector implies that document vectors are orthogonal to each other.

S. K. M. WONGet

56 5. EVALUATION

sim(4,

OF THE UNIFIED GVSM

In this section we provide an evaluation of the unified model. The evaluation is done in two stages. We first give a brief description of the p-norm model, which was proposed by Salton and his co-workers for extended Boolean retrieval, and contrast the models at the conceptual level. Secondly, we present experimental results which compare the retrieval effectiveness of these two models both for initial searching and for searching after relevance feedback is given. 5.1. The p-norm model and a comparison of general features

In the p-norm model [12,13], the documents are representedbyasetofvectors{d,la=1,2,...,m}in a vector space spanned by the set of orthonormal term vectors {tili = 1,2,. . . , n}. Each document vector can be defined as d, = i axi&, r-l

(22)

where 0
al.

r(l -a,,)Pwf+(l =

+(l

l-

-aa,,)Pwj+

...l’p

-amjpw,P

wp+wp+

i

“. + w,p

(24) 1

The above expressions handle queries that are in standard form. For the mixed qyeries with both AND and OR operators, the formulas (23) and (24) can be invoked separately for each clause, which is of the standard form. The evaluation of d, with respect to such clauses can then be combined by recursive application of the same computations as those employed at the level of clauses, until the query is completely processed. In the p-norm model, the interpretation of the query can be altered by using different p-values to compute the query-document similarity. When p = 1, the similarity measures for the OR-query and ANDquery are the same. In other words, the distinction between the AND or OR connectives in a query disappears, and the conventional vector processing model (where the similarity is based on the standard inner product function) is obtained. In contrast, when the query terms are all equally weighted and p = CO,expressions (23) and (24) can be written as

1QoRe)

sim (d,

aJwf+a$wf+

= lim

...

+a$w/

wp+wp+...+w,p P-m [ =max[a,,,ar2,.. ]7 .? aWI

and lId,llp=

QANH 1

llazl,a,,,...,a,~~,

1QAND-) -a,,)Pw[+(l -a,,)Pw$+ / r(l ... +(l -al,)pw,P

1 ‘jp

(25)

sim (d, = (a$ + a:

+

, . . , + ak)‘@,

where p is a rational number in the range from 1 to Co. Let wi be the weight of query term ti, 0 < wi < 1. Based on the p-norm model, the generalized Boolean and AND-query, QANo, can be OR-query, written as

QORt

= lim

0-m i

l-

i

=min[a,,,a,,,...,a,].

wf+wp+.“+w,p

...l’;P\

II

(26)

Equations (25) and (26) indicate that when p = co and the query is unweighted, the value of the querydocument matching function is given by the highestweight among document terms for an OR-query and the lowest weight among document terms for an AND-query. This is precisely the way in which weights are handled in the fuzzy-set model. It is easy to also see that, under this special case when document terms, are also unweighted, these expressions reduce to the matching functions of the strict Boolean model. In comparing the unified GVSM to the p-norm model, the following points are noteworthy:

QonP=[(t,rW,)ORP(t*,W2) ORP..

. OR p(4, ~31,

and

=[(~,vw,)ANDp(t,,%) QAND' AND *. ’ . kNDP(t,,

w.)].

The similarity measures between a given document d, = (a,, , az2,. . . , a,,) and QoRp, and between d, and are defined respectively by

QAND~ sim (d,, QORp 1 =

af,w{+a$w{+

...

wp+wp+...+w,p

+aLwi

]

‘lP

3 (23

(i) Both models can handle weighted Boolean queries (i.e. structure perceived by the user can be incorporated into the query by using Boolean connectives).

Generalized vector space model

(ii) Both reduce to VSM and strict Boolean retrieval models respectively under certain conditions. (iii) p-norm model involves the parameter p which has to be experimentally determined. No such parameters are involved in the unified GVSM. Moreover, the range of p-values is not as broad as it may seem, since there is very little difference in performance between the cases when p-values as small as 3 and when p is much larger. More specifically, when p = 1 or P = 1.5, the performance is very good; but the performance drops quite drastically, even for small increases in p value, to the point that making p = 3 tends to give results that are nearly as bad as setting p to cc. (iv) The proposed extended query language preserves more Boolean algebraic properties under the unified GVSM (e.g. associativity is violated in the p-norm model). (v) In the unified GVSM, there is not a smooth transition from the extended Boolean, at one extreme, to the conventional vector processing, at the other extreme. In fact, we feel that the unified GVSM may be closer to the strict Boolean model than it is to VSM. In other words, the role of Boolean operators is rather strictly retained. The p-norm model, in contrast, achieves more softening of the operators. (vi) The p-norm model has the problem that it ignores term correlations. The notion of Lp-norm is applied to obtain similarity functions sim (d,, QoRp) and sim (d,, QANDp ) assuming that documents can be represented in a n-dimensional vector space spanned by a set of orthonormal basic term vectors. For example, the length of any dz can be expressed as

Now if we assume t; t, = 0 for i # j, we immediately obtain

That is. by representing the vector length using the &-norm. it is implicitly assumed that terms are not correlated (or, that all the terms are pairwise orthog onal). Since term vectors are not in general orthogonal to each other, it is questionable whether the notion of generalized scalar product for the similarity function can still be used. The unified GVSM, in contrast. precisely specifies what is meant by term correlation and how it is to be incorporated into the retrieval scheme. 5.2. ExPerimenfai eqa,pluation 5.21. General ~pecr~eafiQn. In the experiments performed. the following collections are used: MEDLARS and CISI. Among the collections that were provided through the SMART system involving extended Boolean queries, only these two collections were small to be handled within the limits of the current implementation and our computer resources. MEDLARS is a collection of 1033 documents in the

5-l

field of biomedicine. There are 30 queries designed to be used in the Boolean processing environment. The MEDLARS collection was prepared by the rational Library of Medicine. The CISI consists of 1460 documents on library science, and has 35 Boolean queries. The CISI collection was obtained from Cornell University, where the document titles and abstracts, for highly cited articles identified by the Institute of Scientific Info~ation, were key entered [24]. The collections include, for evaluation purposes, information as to which documents are relevant to each query. The standard recall and precision measures are used for comparing the performance of different info~ation retrieval models. Recall is defined as the proportion of relevant documents retrieved out of the total number of relevant in the whole collection and precision is the proportion of the relevant documents retrieved out of the total number retrieved. The overall performance of a retrieval strategy is determined by computing the average precision for recall values 0.1, 0.2,. . . , and I. The aigorithm for averaging is consistent with that implemented in the SMART system. 5.2.2. Experiments with extended Boolean queries. For comparison purposes, performance results for strict Boolean, fuzzy-set, and p-norm systems were obtained from Come11 University 112,141. Binary weights for both documents and queries are used in strict Boolean, p-norm, and GVSM models, while weighted documents are used in the fuzzy-set model runs. In particular, the results referred to as fuzzy-set correspond to the use of the p-norm model with P = co. Although the unified GVSM and the p-norm models can be tested under the condition that documents have weighted terms, we feel that it is not necessary. In other words by providing a comparison using binary documents, we are better able to isolate the impact, of how the extended Boolean query is handled, on retrieval performance. In a strict Boolean system, where the set of relevant and nonrelevant retrieved items is not ranked by the system, a ranked output can be simulated by defining an order for the retrieved items in which relevant items are randomly placed among the set of nonrelevant ones. Such a simulated ordering operation permits recali and precision computation for the base method. For the p-norm model, the results corresponding to p = 2 are reported because the AND and the OR operators are treated differently in this case. Tables 2 and 3 summarize the results, respectively, for MEDLARS and CISI collections. The columns labeled strict Boolean and fuzzy-set are from Salton and Voorhees [ 141. Tbe results of p-norm (p = 2, binary weights) were obtained from Cornell by private correspondence. The improvement of fuzzy set, p-norm and unified GVSM approaches over strict Boolean for the MEDLARS collection are, respectively, 13, 111 and 99%. When equation (21) is used to transform Boolean queries, under some special conditions, a

S.

58

K.

M. WONC et al.

Table 2. Performanceof different MEDLARS Recall

Strict Boolean

Fuzv-xt

0.1 0.5528 0.2 0.4313 0.3 0.3065 0.4 0.2370 0.5 0.1630 0.6 0.1532 0.7 0.1065 0.8 0.0769 0.9 0.0381 1.0 0.0321 AVWagc precision 0.2097 Average % improvement over strict Bookan system

query may be converted to a null vector. This hap pened for Q6 and Qlo in this collection. Therefore, the unified GVSM results are averaged over only 28 queries. The performance improvements for the above methods in the case of the CISI collection are 14, 62 and 42%. respectively. From these results, it is seen that unified GVSM provides substantial improvements over fuzzy-set and strict Boolean approaches. However, these improvements are not as high as that obtained from the p-norm model. In order to provide a proper perspective on the results reported above, we next provide a comparison of the unified GVSM relative to the performance of the p-norm model for a range of p-values. These results are summarized in Tables 4 and 5. For comparison purposes a single precision value is used, which represents the average of precision values at three typical recall levels. The recall levels considered are a low recall of 0.25, a medium recall of 0.5, and a high recall of 0.75. In cases where precision values are known only for values 0.1, 0.2,. . . , 1.0, the values for 0.25 and 0.75 are computed by averaging, respectively, precision at (0.2, 0.3) and precision at (0.7, 0.8). In Section 1 it was mentioned that the Boolean operators are applied rather strictly under the unified GVSM. The results in Tables 2 and 3 are consistent with that observation. That, is, the unified GVSM

Bookan system

(1033 WCS, 30 QUES) FYecision p-n0Im.p =2

UnitkdGVSM

0.6339 0.4609 0.3398 0.252s 0.1866 0.1779 0.1259 0.0926 0.0525 0.0448

0.7532 0.6875 0.6267 0.5486 0.4493 0.4323 0.3655 0.3059 0.1610 0.0866

0.6907 0.6028 0.5274 0.4972 0.4638 0.38ti 0.3362 0.2893 0.2301 O.ISI8

0.2373

0.4417

0.4172

+ 13%

+111%

+ 99%

provides precision values that seem comparable to the situation where a p-value of around 9 is used, for both collections. The result obtained here may however be preferable from the technical standpoint that most of the algebraic properties are preserved in unified GVSM. In contrast, under p-norm model, more types of queries that would appear to logicians to be equivalent will lead to different retrieval results. Thus, the tradeoff one has to consider is between maintaining desirable algebraic properties (and, hence alleviate confusion during query formulation) vs permitting considerable softening of the operators. It is not difficult to convince oneself that greater softening of operators would lead to more Boolean algebraic properties being destroyed. Such a conclusion can also be reached by an analysis of the approach in Paice [25]. Furthermore, the use of parameter p, while providing certain flexibility, brings with it the problem of having to determine the best value of p for each situation. This problem can be circumvented, however, if it is determined that a particular p value, say p = 2, is good enough for all collections. Although the unified GVSM does not have a parameter such as p in the p-norm model, the two extreme cases have an analogue. The case p = 1 corresponds to the basic GVSM and the case p = co corresponds to the special case of the unified GVSM

Table 3. Perforrnancc of different Boolean systems CISI (1460 DOCS. 35 QUES) Precision Recall

Strict Boolean

FunV-Set

0. I 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.2259 0.1951 0.1405 0.1202 0.0982 0.0800 0.0544 0.0482 0.0405 0.0361

0.2916 0.2397 0.1592 0.1312 0.1048 0.0808 0.0547 0.0478 0.0399 0.0354

0.3559 0.2718 0.2112 0.1869 0.1671 0.1421 0.1229 0.0975 0.0782 0.0542

0.2950 0.2324 0.2052 0.1668 O.lso3 0.1305 O.IOW 0.0862 0.0625 0.0408

0.1040 improvement over stric :tBoolean svstem

0.1185

0.1688

0.1475

+14%

+ 62%

+ 42%

A&age

D-norm. D =2

UnifiedGVSM

%

Generalized vector space model Table 4.

Comoarison

of unified GVSM

with ~-norm

(1033 DQCS, 30 QUES) Average precision at % difkrence three recall from slnct Boolean D-value ooints

59 model

MEDLARS

TWX of run Strict Boolean p-norm binary weights p-norm bmary weights p-norm binary weights p-norm binary weights Unified GVSM o-norm wclnhted dots. bSM (cosin;) tSource

1141: :Source

07.

I 2 5 9 oc -

0.2079t 0.4710: 0.47673 0.4720: 0.45991 0.4473 0.2321 t 0.4605

-

+ 126% + 129% + 127% + 121% +ll5% + 11.2% + 121%

% difference from cosme - 55% +2% +3% +2% 0% - 3% - 50% -

[I2].

in which the documents are assumed to be orthogonal (see Section 3.1) to each other. In general, we hypothesize that documents and queries should be represented by a vector sum of ms. The choice made for representation of documents can be varied to involve several ms (according to some criterion) or just the dominant atom as in the strict Boolean model. This perspective suggests the possibility that document representation can be varied by having other prescriptions for the choice of ms than that based on the arguments in subsection 2.3.3. We are currently looking at some experiments along these lines. It is shown in Wong ef al. (161 and Wong (171 that the basic GVSM obtains significantly better results than VSM for almost all collections. Table 6 shows the comparative performance of the basic GVSM and the conventional VSM for the two collection considered here. It is seen that, both in Medlars and CISI, basic GVSM gives better results. In the light of the fact that basic GVSM yields extremely good performance and that the p-norm model does not really take term correlations into account, we believe that it may be attractive to develop models that provide the benefits of both by incorporating term correlations and by employing the parameter p to fine-tune performance according to the situation at hand. 5.2.3. Experiments performing relevance feedback with extended Boolean queries. Most modem on-line document retrieval systems enable the use of information derived from an initially retrieved set of documents to improve the query with which a subsequent search can be performed. A particular realization of this process is called relevance feedback. Relevance feedback methods were first used by Rocchio as part of a retrieval strategy for the SMART project [5,6]. This scheme was applied to the conventional vector processing system in which a query q or a document d, is represented by a linear combination of (pairwise orthogonal) term vectors. Rocchio showed that an approximation to an optimal query vector may be generated by two steps. In the first step, the terms extracted from previously retrieved documents that are identified as relevant to the query are added to the initial query. The second step is to subtract the terms

extracted from previously retrieved, nonrelevant documents from the initial query. This relevance feedback process proposed by Rocchio can be directly applied to our GVSM to enhance the performance of either basic GVSM or unified GVSM strategies. The following procedure outlines the main steps involved in the context of the unified GVSM: (1) Steps outlined in Section 2.2 are used to transform the Boolean expression into a query vector q,. The q, should be expressed in terms of the atomic vectors or the ms. (2) The query-document similarities are computed. (3) Let D’ denote the set of documents retrieved. From the set of retrieved documents D’, the identify the set D 1 of n, relevant documents, and the set 02 of n, nonrelevant documents. (4) Use vector addition to compute the average vector r for the relevant documents and the average vector s for the nonrelevant documents, namely,

(5) Construct a new query q, from initial query q, by adding the average vector of relevant documents and subtracting the average vector of nonrelevant documents as follows: q1 =qo+r-s.

(27)

The modified query q1 is expected to exhibit greater similarity with the relevant items and smaller similarity with the nonrelevant ones than the original qo. In general, the form of equation (27) involves two parameters as follows: qi = 90-t 7 x d,- 6 c d, &.FDl QED2 It has been shown in Yu et al. [33], through rigorous analysis, that setting y = l/n, and 6 = l/n, is one of the highly promising choices. Our experiments reflect that recommendation. In order to implement the above idea appropriately, a partial rank-freezing procedure is performed

s. K.

60 Table

5. Comparison CISl

Type

of run

tSource

of unified (1460

p-value

Strict Boolean p-norm binary weights p-norm binary weights p-norm binary weights p-norm binary wetghts Unified GVSM p-norm weighted dots. VSM (cosine) [14]; :Sourcc

WOHGer al.

hf.

* I 2 5 9 -

6. Convcnuonal

VSM

30 -

Recall 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 I .o Average prmsion Average % lmprovemcnt over stnct Boolean

VW

model

“6 difference from cosme

-

-9% +46% -46% + 46% +37% + 34% + 2% -

+ 59% + 60% + 60% + 50% +46% + 12% +9%

[l2].

(cosme)

MEDLARS

with p-norm

0.1os9t 0.1687: 0.1692: 0.1691: 0. I5861 0.1549 0.1185t 0.1 I58

in which the retrieval ranks of any relevant documents identified in the earlier search are frozen in the subsequent searches performed for the same query. At the same time, the nonrelevant documents retrieved in an earlier search are simply deleted from the collection and not further considered in the feedback process. This procedure insures that any relevant document previously retrieved and used for the construction of feedback queries is counted fairly for evaluation purposes, in the original as well as in any subsequent searches [271. The relevance feedback experiments consist of the following steps: (a) Perform an initial retrieval (run A) using the original query. (b) Identify the relevant and nonrelevant items retrieved in the top n ranks (e.g. n = 10 for the experiments in Table 7); construct a first iteration feedback query. (c) Perform retrieval (run B) using the first iteration feedback query; rank the retrieved documents in decreasing order of querydocument similarity; at the same time adjust the ranks of all documents in the output obtained for run A according to the rankfreezing scheme explained above. (This result is referred to as conrinued run A.) (d) Compare the output of retrieval run B with the output of the cont. run A.

Table

GVSM

DOCS, 35 QUES) Average precision at % difference three recall from stnct pomtr Boolean

The outputs of run B and cont. run A are precisely comparable because the same documents are missing from both runs (i.e. the nonrelevant documents appearing in the top n ranks for the original run A), and the relevant documents already seen are assigned the same ranks. The results in Table 7 show that, for the MEDLARS collection, the feedback scheme described above for reformulating a Boolean query gives significant improvement. On the other hand, there is little improvement for the CISI collection. In the latter case, the reason must be that there are not very many relevant documents (as indicated by small precision values at various recall levels) in the top 10 ranks. Consequently, we repeated the CISI experiments using a retrieved set of 25 documents (n = 25) and, as seen from Table 8, significant improvement is achieved. Several attempts have been made in the past to develop methods to perform relevance feedback for (extended) Boolean queries [8, 17,20,27, 30, 311. The methods of both Dillon et al. (28,291 and Salton et al. [23,27,30,3 I] consist of two main processing steps. The first step is the construction of “good” term clauses. A clause is defined as a single term or a conjunction of several terms (i.e. pairs or triplets) connected by AND operators. The second step involves the generation of a Boolean query statement

vs basx GVSM

(I033 DOCS. FYecision

(documents

30 QUES)

Basic GVSM

and aucria CISI

(1460

have non-binary WCS. Recision

wewhts)

35 QUES)

VSM

Basic GVSM

0.7824 0.6931 0.5879 0.5450 0.4409 0.3821 0.3296 0.2706 0.1547 0.0832

0.8280 0.7685 0.693 I 0.6358 0.5907 0.5263 0.4469 0.3866 0.2841 0. I549

0.2489 0.1806 0.1554 0. I I87 0. IO50 0.0927 0.079 I 0.0696 0.0588 0.038 I

0.2704 0.2013 0.1612 0. I284 0.1143 0.0955 0.0825 0.0646 0.0560 0.03%

0.4269

0.5315

0.1147

0.1214

103%

+ 153%

10.3%

16.7%

Generalized Table

7. Relevance feedback MEDLARS

vector

space

resuhs (run A is the same as unified (1033 WCS. Preciston

61

model GVSM.

The cutoff

CISI

28 QUES)

(1460

for feedback

IS IO)

DOCS. 35 QUES) Precision __.___-Cant A Run B

Recall

Run A

Run B

Run A

0.1

0.6907 0.6028 0.5274 0.4972 0.4638 0.3828 0.3362 0.2893 0.2301 0.1518

0.7838 0.7316 0.6486 0.5997 0.5392 a.4352 0.3742 0.3154 0.2496 0. I570

0.8603 0 8164 0.7847 0.7491 0.6502 0.5565 0.4R08 0.3814 0.2718 0 1533

0.2950 0.2324 0.2052 0.166X 0.1503 0.1305 0.1054 0.0862 0.0625 0.040x

0.3620 0.2837 0.2340 0.17x3 0.1576 0 1353 0.1077 0 OX76 0.0630 0.041 I

0.3710 0.3163 0.2309 0.1x20 0.1478 0.1221 0.1009 0.0X28 0.0634 0.03Y3

0.4172

0.4838

0 5705

0.1475

0 1650

0.1656

0.2

0.3 0.4 0.5 0.6 0.7 0.8 0.9

I .o Average precision Average % imnrovement Ru’n B over Coot A is

Cont

A

of

in disjunctive normal form, incorporating some of the clauses, in such a way that the clauses used are of good quality. More precisely, the final query would be of the following form: (S, OR S2 OR

S,) OR [(I’, AND PI)

OR (P, AND P.,) . OR (Pp_, AND PJ] OR [(T, AND T, AND T3) OR . . . ‘. OR (T,_: AND T,_? AND T,)],

0 4%

IX%

(28)

where S,s, P,s and T,s are respectively single terms, paired terms and tripled terms. The method by Dillon et al. uses a measure called “prevalence” to decide the goodness of terms. In the, so called. DNF method [27,30,31] the method of determining goodness of single terms, pairs, etc. adopted is related to the notion of relevance weight introduced in the context of probabilistic retrieval models [9, 32. 331. In addition to that, the details of how a feedback query of the form in expression (28) is generated differ considerably between these two approaches. In any case, without going into those details, we can conclude from the experiments in Salton er al. [31] that the DNF method is able to generate better queries, whereas Dillon’s approach typically leads to decrease in retrieval quality. The DNF method is further investigated in Salton et al. [27]. Several different contexts, depending on whether the initial search is strict Boolean or p-norm and, when it is strict Boolean, how the set of documents to be used for feedback purposes are selected, are considered. The experiment of particular interest, ois-d-vis this paper, is the one when the p-norm model is used for initial search and feedback search. In this experiment, when the results from the first iteration feedback are compared to original query results continued once (as in Cont A described above). there is an improvement of 14 and 2% respectively for MEDLARS and CISI collections, It is easily seen that the improvements obtained by the relevance feedback scheme under the unified GVSM are entirely comparable to their results.

A more important

aspect of the proposed scheme,

uis-ci-uis earlier methods, is that the weights for single

terms, pairs, etc. are “implicitly” considered through the assignment of weights to atomic expressions. Furthermore, the explicit computation of the weights of atomic expressions is carried out just once. In contrast in the DNF method the pairs, triplets, etc. have to be selected after each iteration and their quality assessed from scratch. It is also worth pointing out that the generation of the feedback query, by adding and removing clauses. would be computationally much slower than that based on equation (27). 6. CONCLUSIONS In the context of the conventional vector space model [l], there has been no formal method either for determining term correlations or for the incorporation of such correlations into the retrieval strategy. A model termed the GVSM (Generalized Vector Space Model) was introduced to fill the gap [ 17, 181. Extensive experiments have been performed on the basic GVSM and the scheme has been found to be Table 8. Relevance feedback results for CISI (run A is same as unified GVSM. The cutoff for feedback is 25) CISI Recall

(1460

DOCS.

Run A

0.1 0.2950 0.2 0.2324 0.3 0.2052 0.4 0.1668 0.5 0.1503 0.6 0.1305 0.7 0.1054 0.8 0.0862 0.9 0.0625 1.0 0.0408 Average precision 0.1475 Average o/o improvement of Run B over Cont A 17.94%

35 QUES) Precision Cont

A

Run B

0.3760 0.2936 0.2498 0.2030 0. I779 0.1467 0.1 126 0.0903 0.0644 0.0415

0.4449 0.3703 0.3079 0.2483 0.2054 0.1599 0.1270 0.0996 0.0660 0.0413

0.1756

0.2071

62

S. K. M.

highly successful [17]. The basic GVSM, however, assumes that the query does not involve any structural specification. That is, queries are specified as a list of (term, weight) pairs. In this paper, a prescription for processing extended Boolean queries under the premises of GVSM is advanced. This extended model, termed the unified GVSM is compared with the p-norm model both conceptually and experimentally. The advantages and disadvantages of each are identified. Both these models are compared to the retrieval effectiveness of the strict Boolean retrieval and are found to perform significantly better. Our experience with the unified GVSM model demonstrates that the major factor contributing to this gain is the choice of document representation; that is, the view of documents as consisting of several atomic expressions, rather than only the dominant atom.

A scheme by means of which user queries specified as Boolean expressions can be reformulated through relevance feedback is also proposed. This scheme is compared with other found to be attractive.

competing

proposals

and

is

An important consequence of this work and other related research investigations is the demonstration of the

feasibility

of the

development

of retrieval

systems that are based on a hybrid model of retrieval process.

In

particular,

our

work

on

the

unified

GVSM as well as that of Salton et al. on p-norm model indicate that existing Boolean systems can be enhanced, without undue additional cost, to benefit from the flexibility of vector approaches. It is, now, up to the practitioners to give these promising alternatives a chance to be evaluated in a field study. Another important conclusion is that it appears likely that an even more comprehensive model that combines the unified GVSM and the p-norm models can be formulated. Such a model should be able to incorporate the positive aspects of both. Further investigations in this direction are warranted.

REFERENCES [l] G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York 119R31. \----I-

PI D. H. Kraft. A decision theory view of the information

retrieval situation: an operations research approach. J. Am. Sot. Inform. Sci..24, 368-376 (1973). 131 S. E. Robertson. The probability ranking principle in IR. J. Docum. 33, 294-304 (1977). [41 F. W. Lancaster and E. G. Fayen. Information Rerrieval On-Line. Melville. Los Angeles (1973). (51 J. J. Rocchio Jr. Relevance feedback in information retrieval. In The SMART Retrieval SystemExperiments in Automatic Document Processing (Edited bv G. Salton), Chap. 14. Prentice-Hall, Englewood Ciiffs, N.J. (1971). _ The SMART Retrieval SystemI61 G. Salton. Experiments in Automatic Document Processing, Chap. 15, 16 and 18. Prentice-Hall, Englewood Cliffs, N.J. (1971). I71 A. Bookstein. A comparison of two systems of weighed

WONG et

al.

Boolean retrieval. J. Am. Sot. Inform. Sci. 31(4), 275-279 (1981). [8] J. T. Rickman. Design considerations for Boolean

search system with automatic relevance feedback processing. Proc. National Meeting of the ACM, New York, pp. 478-481 (1971). [9] S. E. Robertson and K. Sparck Jones. Relevance weighting of search terms. J. Am. Sot. Inform. Sci. 27(3), 129-146 (1976). [lo] V. Tahani. A ftmxy model of document retrieval systems. Inform. Process. Mgmt. 12, 177-187 (1976). [l l] C. J. van Rijsbergen. A theoretical basis for the use of co-occurrence data in information retrieval. J. Docum. 33, 106-l 19 (1977). [12] G. Salton, E. A. Fox and H. Wu. Extended Boolean retrieval. information Commun. ACM 26(11), 1022-1036 (1983). [13] G. Salton, E. A. Fox ‘and H. Wu. An automatic environment for Boolean information retrieval. Information Processing 83: Proc. of the IFIP 9th World Computer Congress, Paris, France, pp. 755-762 (1983). [14] G. Salton and E. Voorhees. Automatic assignment of soft Boolean operators. Proc. of the 8th Annual ACM-SIGIR Conf. Research and Development in Information Retrieval, Montreal, Canada, pp. 54-69 (1985). [ 151 J. Verhoeff, W. Goffmann and J. Belxer. Inefficiency of the use of Boolean functions for information retrieval. Commun. ACM 24, 557-558, 594 (1981). [16] P. V. Angione. On the equivalence of Boolean and weighted searching based on the convertibility of query forms. J. Am. Sot. Inform. Sci. 26, 112-124 (1975). [17] S. K. M. Wong, W. Ziarko, V. V. Raghavan and P. C. N. Wong. On modeling of information retrieval concepts in vector spaces. ACM Trans. Database Syst. 12(2), 299-321 (1987). [18] P. C. N. Wong. A Generaked Vector Space Model for information retrieval systems. MSc. thesis, Department

of Computer Science, University of Regina, Regina, Saskatchewan (1985). [19] J. L. Gersting. Mathematical Structures for Computer Science. Freeman, San Francisco (1982). [20] D. Buell. A general model of query processing in information retrieval systems. Inform. Process. mgmt. 17, 249-262 (1981). [21] W. Waller and D. H. Kraft. A mathematical model of a weighted Boolean retrieval system. Inform. Process. mgmt. 15, 235-245 (1979). 1221 _ _ D. J. Harper and C. J. van Rijsbergen. An evaluation of feedback in document retrieval using co-occurrence data. J. Docum. 34, 189-216 (1978). [23] S. K. M. Wong and W. Ziarko. A unified model in information retrieval. Fundam. Inform. 10, 35-36 (1987). [24] E. A. Fox. Characterixation of two new experimental collections in computer and information science containing textual and bibliographic concepts. Technical Report TR-83-561, Department of Computer Science, Cornell University, Ithaca, N.Y. (1983). [25] C. D. Paice. Soft evaluation of Boolean search queries in information retrieval systems. Inform. Technol. 3(l), 33-41 (1984). [26] C. T. Yu, W. S. Luk and T. Y. Cheung. A statistical model of relevance feedback in information retrieval. J. ACM u(2), 273-286 (1976). (271 G. Salton, E. A. Fox and E. Voorhees. Advanced feedback methods in information retrieval. J. Am. Sot. Inform. Sci. 36(3), 200-210 (1985). [28] M. Dillon and J. Deeper. Automatic relevance feedback in Boolean retrieval systems. J. Docum. 36, 197-208 (1980). [29] M. Dillon, J. Ulmschneider and J. Desper. A prevalence formula for automatic relevance feedback in

Generalized vector space model Boolean systems. Inform. Process. mgmr 19(I), 27-36 (1983;. [30] G. Salton, E. A. Fox, C. Buckley and E. Voorhees. Boolean query formulation with relevance feedback. Technical Report TR-83-539, Department of Computer Science, Cornell University, Ithaca, N.Y. (1983). [31] G. Salton, E. A. Fox and E. Voorhees. A comparison of two methods for Boolean query relevance feedback. In@rm. Process. mgmt. 20(5/6f, 637-651 (1984).

63

[32] W. B. Croft. Experiments with representation in a document retrieval system. Inform. Technol. Res. Lb. 2(l), I-21 (1983). [33] S. E. Robertson, C. J. van Rijsbergen and M. F. Porter. Probabilistic models of indexing and searching. In Information Retrieval Research (Edited by R. N. Oddy ef al.), pp. 35-56. Butterworth. London (1981).