Information Processing Letters 71 (1999) 29–34
Verbs are not cases: Applying case grammar to document retrieval P.C. Chu 1 Department of Accounting and MIS, Fisher College of Business, The Ohio State University, 2100 Neil Avenue, Columbus, OH 43210, USA Received 2 September 1998; received in revised form 24 May 1999 Communicated by D. Gries
Keywords: Information retrieval; Case grammar; Relational database; Vector Space Model; Semantic Vector Space Model
1. Introduction In VSM [4], the thematic roles of index terms are not represented. Consider a query on A suing B (e.g., DEC sued Intel). VSM would retrieve not only reports on A suing B, but also reports on B suing A. The latter is opposite to the query in semantics. This is because the model does not differentiate the role of A (the accuser) from that of B (the accused). Case grammar presents a way to represent thematic roles.
2. Case grammar Case grammar aims at discovering the deep structures of human languages. In his famous paper “Case for Case”, Fillmore [1] asserts that “The sentence in its basic structure consists of a verb and one or more noun phrases, each associated with the verb in a particular case relationship”. A sentence can be represented by: V + C1 + C2 + · · · + Cn ,
man judgments about events such as who did it, who it happened to, what got changed, etc. [1]. Some illustrative cases of a verb include: Agentive (A):
instigator of the action (specified by the verb);
Instrumental (I):
inanimate force or object causally involved in the action;
Dative (D):
animate being affected by the state or action;
Locative (L):
spatial orientation of the state or action;
Objective (O):
inanimate object affected by the action or state;
Benefactive (B):
animate being or organization benefited by the action;
Quantitative (Q):
number of objects affected by the action or state;
Temporal (T):
time at which the action or state occurs.
where V stands for verb and Ci for case i. Cases, which are also called thematic roles, are types of hu1 Email:
[email protected].
In the sentence “Wellington defeated Napoleon at Waterloo”, defeated is the verb, Wellington is an
0020-0190/99/$ – see front matter 1999 Published by Elsevier Science B.V. All rights reserved. PII: S 0 0 2 0 - 0 1 9 0 ( 9 9 ) 0 0 0 8 0 - 0
30
P.C. Chu / Information Processing Letters 71 (1999) 29–34
argument for case A, Napoleon for case D, and Waterloo for case L. In essence, cases provide additional information about noun phrases, which are arguments to the cases. This additional information sheds light to the roles assumed by noun phrases. Therefore, the use of case grammar in IR has the potential of capturing semantics not available with VSM. The set of cases associated with a verb is called case frame. For a particular verb, some cases in its case frame are obligatory, others optional. It is important to note the important features of case grammar: (1) verbs, which describe action or state, are central in sentences, (2) verbs are NOT cases themselves, and (3) noun phrases as arguments to cases provide details about the action or state specified by the verb.
3. Prior applications of case grammar in IR Several studies insightfully applied case grammar to IR [3,5,6]. However, they made a same conceptual error in their formulations. In a sentence represented by V + C1 + C2 + · · · + Cn , all of them [3,5,6] convert V into a case such that the sentence becomes C1 + C2 + Cv + · · · + Cn , where Cv is the case category Action created for verbs. Thus, verbs become one of the cases, which can exceed 30 in number (e.g., see [3, p. 397]). This formulation preserves the vector structure of VSM and is aptly called Semantic Vector Space Model (SVSM) [3]. This seemingly innocuous adaptation of case grammar to fit the mold of VSM has serious ramification
both in theory and in practice. Theoretically, verbs are NOT cases, as cases are about aspects of verbs. Treating verbs as cases, thus, violates the theory of case grammar. In practice, this violation results in diluting the differentiating power of verbs. Since action is but one of a large number of cases, it is highly likely that a query may retrieve documents of opposing nature. This point is illustrated with a simple example. Consider two documents described by two sentences: “John saved Mary with a gun in Africa in 1997” and “John killed Henry with a rope in England in 1982”. When verbs are treated as case, we have the representation given in Table 1. To a query, say, about John killing someone with a gun in 1997, similarity measures for the two documents would be different. The first document has three matches in case values (John, gun, 1997), the second document has two (John and kill). Thus, the first document has a higher similarity ranking than the second document in spite of the fact that the semantics of the first document is diametrically opposed to the query (save versus kill). This bizarre result is due to the dilution of the differentiating power of verbs. Obviously, the greater the number of cases used, the greater the dilution. The above illustration over-simplifies SVSM in light of the fact that a document representation in SVSM, which is derived by processing multiple sentences rather than a single sentence, is more complex with weights assigned to tokens (case arguments) plus probability assessments of tokens’ case categories. Nevertheless, it serves to illustrate that treating verbs as cases is a conceptual flaw. In SVSM, unless the weight assigned to the verb as a token is much larger than the weights assigned to other tokens (there is no guarantee that it would be the case), the above problem is likely to occur.
Table 1 Agentive
Dative
Action
Instrumental
Locative
Temporal
John
Mary
save
gun
Africa
1997
John
Henry
kill
rope
England
1982
P.C. Chu / Information Processing Letters 71 (1999) 29–34
4. A conceptually sound approach We present a simple and theoretically sound approach to applying case grammar to IR. In a sentence represented by V + C1 + C2 + · · · + Cn , the key to our approach is to single out V and separate it from the cases associated with it. This concept is represented by V C1 + [C2 ] + · · · + Cn . We use brackets to denote optional cases. For example, the general structure for sentences describing the event resign is represented as: Resign(agentive + [objective] + [causative], . . .). By substituting familiar terms for those used in case grammar and by making a small change in notation and adding a field DocID to identify documents, we have: Resign(DocID, agent, [position], [cause], . . .). It is easy to see that this structure resembles that of the relational database schema. The word resign can be thought of as the name of a relation whereas DocID, agent, position, cause are names of attributes. A user’s request for documents about resignations because of Watergate can be formulated as: Select from Resign where cause = “Watergate”. The relationship between a document and the sentence(s) describing it can be one-to-one or one-tomany depending on the content of the document. If a document describes a single act, e.g., resignation, one sentence is sufficient to represent it. On the other hand, if a document describes more than one act (e.g., the resignation of Rubin together with the nomination of Summers as Treasury Secretary), multiple sentences need to be employed to represent the document. In the latter case, in addition to being represented in the relation Resign, the document will be represented by a record in the relation Nominate, whose structure appears below: Nominate(DocID, nominator, nominee, position, . . .). To summarize, our approach includes three simple steps: (1) Single out the verb of a sentence that describes an act reported in a document.
31
(2) Identify arguments for cases included in the case frame for that particular verb. (3) Store the data obtained in step (2) as a record in the relation named by the verb. (4) Repeat steps (1)–(3) until all the acts reported in the document are captured. One result of this approach is that data about documents are stored in multiple logical files (at least one file for each verb) rather than in a single logical file as in VSM or in SVSM.
5. Benefits of the approach The most important property of the approach is that it is theoretically sound. It is faithful to the theory of case grammar in maintaining the centrality of verbs by distinguishing verbs from cases. Being consistent to the theory, the approach prevents the bizarre result described in Section 3 from happening (a document about saving would be retrieved in a query about killing). It is because the search will be conducted not in a single logical file as it is done in prior approaches but in the relation named Kill. Since files are structured as relations, the power of the relational database technology can be used to bear upon the task of IR [2], particularly the power to draw data from multiple files. This point is illustrated with an example. Fig. 1 presents three relations about tennis tournaments. The first relation contains the results of tennis matches, the second relation tournament site information, the third relation information about stadium such as surface type. The structures of the three relations are: Tennis Match(DocId, Winner, Loser, Tournament, Year, Scores), Tournament Site(Tournament, Year, Stadium), Stadium(Stadium, Surface). By joining the first two relations, it is possible to know the stadium in which a particular match occurred, assuming the site of a tournament may differ from year to year. By joining the resultant relation with the last relation, it is possible to find out the characteristics about a particular stadium. Thus, we can retrieve documents that report, say, on tennis matches that were
32
P.C. Chu / Information Processing Letters 71 (1999) 29–34
Tennis match DocId
Winner
Loser
Tournament
Year
Scores
1234
Agassi
Woodforde
U.S. Open
1997
6/2, 6/2, 6/4
3462
Arazi
Dreekmann
French Open
1997
6/2, 6/4, 6/2
1569
Becker
Goellner
Eurocard Open
1997
6/2, 6/4
3789
Chang
Goosens
Australian Open
1997
6/0, 6/3, 6/1
7612
Filippini
Woodruff
AT&T Challenge
1997
7/5, 3/6, 6/4
5558
Krajicek
Sampras
Eurocard Open
1997
6/4, 6/4
1289
Norman
Korda
AT&T Challenge
1997
6/4, 6/4
2154
Norman
Sampras
French Open
1997
6/2, 6/4, 2/6, 6/4
5288
Rios
Bruguera
U.S. Open
1997
7/5, 6/2, 6/4
3801
Sampras
Moya
Australian Open
1997
6/2, 6/3, 6/3
6138
Sampras
Pioline
Wimbledon
1997
6/4, 6/2, 6/4
7772
Stolle
Alvarez
Wimbledon
1997
6/4, 6/4, 6/4
Tournament site Tournament
Year
Stadium
AT&T Challenge
1997
Atlanta Athlete Club
Australian Open
1997
Australian National Tennis Center
Eurocard Open
1997
Stuttgart Schleyerhalle
French Open
1997
Roland Garros
U.S. Open
1997
Arthur Ashe Stadium
Wimbledon
1997
All England Lawn Tennis Club
Stadium Stadium
Surface
All England Lawn Tennis Club
grass
Arthur Ashe Stadium
hard
Atlanta Athlete Club
hard
Australian National Tennis Center
hard
Roland Garros
clay
Stuttgart Schleyerhalle
carpet
Fig. 1. Three relations on tennis tournament.
played on carpet in which Sampras was beaten. Document 5558 reports on such an event: Krajicek defeated Sampras in 1997 Eurocard Open on carpet. Significantly, information residing in a document (Tennis Match) can be related to information outside of the
document (Tournament Site and Stadium) to provide more information. The approach is capable of capturing additional semantics in its ability to represent generalizationspecialization hierarchies. For example, the events
P.C. Chu / Information Processing Letters 71 (1999) 29–34
described by the verb commit arson can be categorized into a generalization-specialization hierarchy by types of arsons: vandalism, excitement, revenge, crime-concealment, profit, extremist, and serial. Each of these may be sub-classified and so on. In this connection, the database technique vertical partitioning can be used to break a large relation into many smaller ones. This would allow a targeted search to be performed on a much smaller relation.
6. Conclusion Prior attempts to apply case grammar to IR are flawed in treating verbs as cases. This paper points out this conceptual error and presents a theoretically sound approach that can benefit from the power of the relational database technology. The approach presented herein is best suited for documents of certain types. In case grammar, verbs are central. As such, case grammar can be useful in representing documents reporting on events and actions such as news. In news reporting, the attention is on what, who, how, when, where, why, etc. These correspond to the thematic roles in case grammar. Furthermore, news has a narrow focus and can be represented by a small number of sentences. The essentials of an event is typically laid out in the first paragraph of a news report, which would reduce the complexity of automatic parsing. Another type of documents that can readily benefit from this approach is reports on scientific research in studying relationship among factors. Typically in this type of work, impact of certain factors on other factors is studied. The thematic roles of the factors studied can be defined as independent variables, dependent variables, contextual variables, etc. Other descriptions can be added such as the type of the research (laboratory, field, survey, etc.). Currently, most journals request authors to supply a list of keywords that describe their articles. The application of case grammar suggests that it is beneficial for authors to provide the thematic roles of these keywords as well. This can be easily done by the authors and would enhance the precision of search. The approach described in this paper can be implemented with or without automatic parsing for case arguments. Automatic parsing, however, is attractive as it is potentially cost effective. The task of automati-
33
cally determining cases with precision is challenging. Prior research in SVSM [3,6] has made good contribution in this regard and can serve as a foundation for future work. In essence, automatic parsing for cases involve three steps: (1) identify the verb in a sentence, (2) identify noun phrases, and (3) determine the case relationships between noun phrases and the verb. The third step is the most difficult one. To assist, heuristic rules can be developed. Two types of heuristic rules can be useful. The first type takes advantage of the verb specific nature of case frames to conduct the search for case arguments. For example, in a sentence with the verb resign, the noun phrase serving as the subject is in the agentive case. Also, in this sentence, we can rule out the existence of the instrumental case because the act of resignation does not involve the use of tools. The second type of heuristic rules can take cues simultaneously from the preposition and the noun phrases following it. For example, if a noun phrase indicating a location is preceded by at or on or in (e.g., at the corner, on the street, in Paris, etc.), the noun phrase is most likely to be an argument for the locative case. Where no reliable heuristics are available, a probability distribution of prepositional case-role realization [3] can be constructed to help determine case assignments. To conclude, this paper identifies a conceptual flaw in prior research to apply case grammar to IR and proposes a new approach. Future research needs to address implementation and evaluation issues.
Acknowledgement We thank the anonymous referee for providing excellent comments, which substantially improved the paper.
References [1] C.J. Fillmore, The case for case, in: E. Bach, R.T. Harms (Eds.), Universals in Linguistic Theory, Holt, Rinehart and Winston, New York, 1968, pp. 1–90. [2] D.A. Grossman, O. Frieder, D.O. Holmes, D.C. Roberts, Integrating structured data and text: A relational approach, J. Amer. Soc. Inform. Sci. 48 (2) (1997) 122–132.
34
P.C. Chu / Information Processing Letters 71 (1999) 29–34
[3] G.Z. Liu, Semantic vector space model: Implementation and evaluation, J. Amer. Soc. Inform. Sci. 48 (5) (1997) 395–417. [4] G. Salton, M.J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, New York, 1983. [5] R.F.E. Sutcliffe, Distributed representation in a text based information retrieval system: A new way of using the vector space model, in: Proc. 14th ACM SIGIR Conference on Re-
search and Development in Information Retrieval, Association for Computing Machinery, New York, 1991, pp. 123–132. [6] E. Wendlandt, J.R. Driscoll, Incorporating a semantic analysis into a document retrieval strategy, in: Proc. 14th ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, New York, 1991, pp. 270–279.