Artificial Intelligence 65 (1994) 181-188 Elsevier
181
ARTINT 1149
Book Review
Paul Jacobs, ed., Text-Based Intelligent Systems * Peter Norvig Sun Microsystems Laboratories, Two Elizabeth Drive, Chelmsford, MA 01824-4195, USA
Received December 1992 Revised March 1993
Text-Based Intelligent Systems is the title of this collection of papers, the symposium that gave rise to the collection, and the emerging field that gave rise to the symposium. In brief, a text-based intelligent system (TBIS) is a computer system that can extract useful information from a large corpus of stored text. There is a terrific opportunity in this area, because recent advances in networking and mass storage technology have brought access to gigabytes or even terabytes of text within easy reach of many. However, text does not become useful information until there is a way to manage it, to match the right answer with each user's query. The 1990s promise to be the first decade of the computer age where the emphasis is on extracting information rather than producing it; where the average computer user gets more out than he or she puts in. Text-based intelligent systems, in one form or another, will be needed to make this possible. Thus, the TBIS may turn out to be the 1990s' equivalent of the data base, spreadsheet, or word processor: an application that changes the face of the entire computer industry, opening up new markets and new opportunities. Jacobs outlines four applications of TBISs while allowing the possibility of other applications: • Text extraction systems take a free-format English (or other natural Correspondence to: P. Norvig, Sun Microsystems Laboratories, Two Elizabeth Drive Chelmsford, MA 01824-4195, USA. Telephone: (508) 442-0508. Fax: (508) 250-5067 (Sun internal mail stop: UCHL03-207). E-mail:
[email protected]. * (Lawrence Erlbaum, Hillsdale, NJ, 1992); 281 pages, ISBN 0-8058-1189-3.
0004-3702/94/$ 07.00 @ 1994 - - Elsevier Science Publishers B.V. All rights reserved
182
P. Norvig
language) text and extract fields that can be put into a more structured data base. For example, from a financial newswire feed a TBIS might pick out all the stories about mergers and acquisitions and build a data base giving the date, price, and companies involved in each transaction. • Automated indexing and hypertext systems allow a user to conveniently browse through a collection of documents, moving from one to another by keying on common terms. • Summarization and abstracting programs attempt to find just the stories that contain important new information and present them in a condensed format. For example, a series of news stories covering an ongoing political event might repeat information mentioned in earlier stories. A summarizer would eliminate this duplication, even when it is not a word-for-word copy. • Intelligent information retrieval systems retrieve documents that are related to the user's query even when they contain different words, and rule out documents that contain the same words as the query used in different ways. There are multiple applications because there are different classes of users. For example, a scholar who presents an online library catalog with a query normally would like to see every article or book that is relevant to the topic. But a casual library user who wants to read a good mystery would be happier with one reply than with a thousand. Historically, there have been two approaches to the task of obtaining information from text. The information retrieval (IR) community has taken the approach that most of the content of a text is embodied in the individual words, and therefore it makes sense to concentrate on extracting as much information as possible just from looking at the probability of occurrence of individual words and perhaps word pairs. They do not deny that information also resides in the syntax, semantics and pragmatics of the text, but have found that the currently available tools that use these techniques do not contribute much to overall performance, and feel that it would take too much effort to try to beef these tools up. The methodology is to find a large corpus of documents and queries for those documents, and to run experiments to test what methods work best on the queries. The evaluation criteria have been recall (percentage of relevant texts found) and precision (percentage of texts found that are relevant). The artificial intelligence/natural language (AI/NL) community has stressed the need for deep understanding of both the stored text and the user's query, under the assumption that arbitrarily complex inferences may be needed to connect the two. The technology of choice for AI/NL has been hand-coded grammars, lexicons and knowledge bases. The m e t h o d o l ~ is to find a difficult or interesting paragraph or two, and to demonstrate in detail
Book Review
183
all that is needed to completely understand the paragraph. The evaluation criterion has been theoretical elegance. A system need not be able to take an arbitrary text and perform correctly on it; it is considered a success if the system works on a handful of carefully chosen examples and if it can be argued that the system would also work on other examples, given the time to hand-code the necessary linguistic and semantic information. As an example of the contrast between the two approaches, consider the following two query/text pairs:
Query: How can I keep other people from seeing my work? Text: You can use crypt to encode the contents of confidential files. Query: What does adj acentscreens do? Text: adjacentscreens tells the window-driver's mouse-pointer tracking mechanism how to move between screens that contain windows. In the first pair, the text is relevant to the query because "my work" can be "the contents of files" and to "encode the contents" is one way to "keep other people from" understanding, and understanding is one sense of "seeing". AI/NL advocates point to examples like this as justifying the need for deep, detailed knowledge. IR people admit that they would probably miss this one, but they would encourage the user to rephrase the query, and would expect to be able to answer most of the queries after a few iterations. In the second pair, the word "adjacentscreens" finds the right text without any need for parsing, inference, or anything else. IR advocates point to examples like this to justify the need for broad, word-based indexing. AI/NL people admit that they might miss this one if their lexicon did not contain "adjacentscreens", but they would point out that there is no reason why the lexicon could not in principle be extended to include such words. The 1990 TBIS symposium (part of the AAAI Spring Symposium series) was the first opportunity for many of the participants in both communities to come face to face with the other camp. From Jacobs's point of view (coming from the AI/NL camp), the important thing to take from the IR camp is their experimental methodology. Science progresses by proposing hypotheses, running experiments to test them, and updating the reigning model to incorporate the winning hypotheses. DARPA (the Defense Advanced Research Projects Agency) appears to agree with this assessment. They have sponsored two series of conferences that are really competitions. In the Message Understanding Conferences (MUC), contestant programs read paragraph-length messages and performed the text extraction task, recording who did what to whom, when. The programs are graded based on how many fields were filled in correctly or incorrectly. There have been three results
184
P. Norvig
of MUC: (1) Universities and industrial research labs that previously had only "toy" implementations have built substantial systems. (2) The scores have improved at each successive MUC. (3) Researchers have spent more of their time on implementation details and less on theoretical musing. The first two of these are clearly good results; the third is more controversial. If the current systems are on the right path, then the DARPA-imposed strategy is a good one. But it is possible that the current systems lead to a dead end, and small modifications will not break out of the local maxima. DARPA has also sponsored the Text Retrieval Conference (TREC), a competition with the same format as MUC, but aimed at traditional IR problems. Novel IR techniques, including techniques from AI/NL, are explicitly encouraged. So DARPA is interested in sponsoring both camps and seeing an exchange of ideas and methodologies, but is not attempting to force a marriage of the two. Most authors in the TBIS book are in general agreement that such a marriage is a good thing, but there is some variation in the views. Some (e.g., Wilks, l Stanffil, Maarek, Lewis, Croft) argue that the use of better statistical or probabilistic techniques is most important. IR advanced significantly when the old strategy of combining terms with boolean operators (ands and ors) was replaced with the new idea of combining terms with a weighted sum. Perhaps another way of combining terms will lead to another advance. It is interesting that Croft, coming from the IR camp, advocates the same Bayesian inference network approach that is now becoming popular as a principled representation for knowledge-based systems in "traditional" expert systems/AI. Stanfill also points out that the field of statistics is changing to accommodate more dynamic models. Some of the authors with hard experience in the IR trenches (Salton, Sparck Jones) warn that before adding a NL technique to an existing IR system, the designer must make sure that the addition will do some good. For example, IR systems of the past used surrogates--abstracts and/or lists of keywords---to represent a document. Now that memory is cheap, the entire document can be kept online to represent itself. But more is not always better; using the entire document may help improve recall, but it can make precision worse, because documents that mention a term only in passing may be spuriously retrieved. One disconcerting result is that a wide variety of different approaches to IR--some encompassing AI/NL and some not--have all resulted in about the same actual system performance. The fact that it is as rare to come up with a technique that is significantly worse than the state of the art as it is to come up with one that is significantly
lOnly the namesof the first authors are used in the body of this review.A completelist of authors and titles appears in the appendix.
Book Review
185
better is an indication that we don't yet understand the problem well enough to see what is really going on. Despite the promise of the field, a careful reading reveals that the reason the field is called "emerging" is that it has not yet emerged. This book demonstrates the clear advantage to marrying NL and IR techniques, discusses a few prototypes, and alludes to the surge in interest by DARPA and commercial concerns. Still, there are no convincing success stories to date. In short, the reader of this book who expects to see fielded applications and commercial successes will be disappointed. But the reader who wants to get in on the ground floor of an exciting, fast-growing field will be informed, challenged and I think pleased with the selections in this book.
Individual papers Part I
The first part of the book covers AI/NL work that has been extended in the direction of robust coverage of large texts. Hobbs describes SRI's entry in the MUC (Message Understanding Conference) competition. This system attempts to do what traditional AI/NL systems do: a complete linguistic and semantic analysis of each sentence. However, it is somewhat non-traditional in the attention it pays to limiting the processing time and recovering from errors. It uses keyword-based statistical techniques to decide which sentences should be skipped altogether and which should be processed. It limits the search through the space of possible parses and inferences, exploring only the most promising ones. And it is able to extract information from sentence fragments when a complete sentence is either ungrammatical or is outside the coverage of the system. Together, these techniques yield a system that is capable of deep, detailed analysis for sentences that are completely within its competence, but that degrades gracefully for sentences that are not. The paper gives a good feel for the problems and tradeoffs involved in message understanding, as well as the special-purpose tricks. For example, in the MUC corpus on Latin American terrorist incidents, it was possible to distinguish between Hispanic surnames and unknown English words using a statistical model based on the frequency of occurrence of three-letter sequences. McDonald presents a system for a similar task--extracting job change information from Wall Street Journal articles---using a different technique. Where Hobbs started with a general grammar of English and added special techniques and vocabulary suitable to the domain, McDonald starts with a grammar that is crafted specifically to pick out the names of people and companies and the relations between them. He recognizes that there will
186
P. Norvig
be some input that cannot be parsed by his grammar, but expects that the information he wants to recover will all be in the portion that he can parse. Hirst describes how a flexible representation scheme can make use of linguistic and semantic analysis when it is available but still be able to represent some of the content of a text when the linguistic analysis fails. Like Hobbs, he recognizes that the linguistic analysis will inevitably fail, but when it does he attempts to mix together the unparsed words with the parsed semantic representations to yield a hybrid representation. Thus, a representation like Hirst's might be appropriate for a system that tried to combine the approaches of Hobbs and McDonald. Wilkes presents a programme for turning Longman's dictionary from a messy, unstructured text into a usable data base. Such a resource would be invaluable to the community, although Wilkes points out some of the pitfalls in obtaining it. It is still unclear whether processing a dictionary like Longman's that was intended for humans will yield better results than trying to build a machine tractable dictionary from scratch. Part H
The second part covers state-of-the-art work in IR. Croft presents the best overview of the field, defining IR as text representation plus query representation plus comparison of the two. It is unfortunate that the IR model stops with the result of the retrieval query and does not consider the presentation of the result to the user. Nevertheless, Croft's model is quite general, encompassing boolean, cluster-based, probabilistic, and vector space approaches. The other three papers show how the basic model can be extended. Sparck Jones explains how IR has switched from indexing short abstracts of documents to their full texts. Lewis considers the problem of classifying documents into separate categories, and the implications of this for text representation. He discusses a representation using two-word phrases, which can be extracted from a text by a parser that can recognize combinations such as subject-verb, verb-object, adjective-noun, and so on. One might think that such a representation would yield a big improvement in retrieval, but Lewis reports that in fact it does not help much. Finally, Salton shows how IR techniques can be used to introduce hypertext-like links between paragraphs of text. This addresses the problem of how best to present the results of a query to the user. The standard technique is to present a list of all t h e retrieved text items that match the query. Sometimes the list is sorted by likelihood of match, but it still requires the user to go down the list, deciding which texts are relevant and which are not. With Salton's approach the retrieved text items are arranged into a graph rather than a simple list. The user can navigate through the graph and
Book Review
187
follow links to other passages that are similar to the current one.
Part III Part III purports to cover applications, but in fact most of the systems discussed here are still in the prototype stage. Stanfill and Maarek report on systems that use traditional IR techniques to perform non-traditional tasks, while Hayes and Hearst present hybrid systems combining shallow AI/NL and IR. It is interesting to contrast the approaches to knowledge-based systems taken by the authors. Hayes' Text Categorization Shell is a system that makes it possible to economically add domain-specific knowledge to the text classification problem, while falling back on weaker methods when the knowledge is absent. The idea is that knowledge should be used when appropriate. Hearst tries to tackle some deep problems in semantic analysis having to do with metaphor interpretation. The aim is to determine not only the subject matter of a text, but also the directional stance it takes. For example, the sentence "The senator proposed lifting the ban on wastewater dumping" has a stance towards dumping that is positive, while if the word lifting were replaced with supporting, then the directional stance would be negative. This is curious, because lift and support are not normally antonyms. Thus, understanding the difference would appear to require sophisticated contextspecific linguistic analysis, but Hearst shows how a general model of force and directionality combined with partial parsing and statistical techniques can perform the task with high reliability. The trick is that she is only trying to extract one bit of information--is this text positive or negative--and an incomplete theory of the kind she describes is sufficient to capture that one bit most of the time. Maarek's help system "explains things without understanding them." To him knowledge should be avoided because it is too expensive. Finally, Stanfill relies on statistical knowledge to do text retrieval and classification. To him the statistics embody the knowledge; the old hand-coded knowledge rules were just an approximation to the underlying Bayesian truth.
Appendix A. Table of Contents 1. Introduction: Text Power and Intelligent Systems - - Paul Jacobs
Part I: Broad-Scale N L P 2. Robust Processing of Real-World Natural-Language Texts - - Jerry Hobbs, Douglas Appelt, John Bear, Mabry Tyson, David Magerman
188
~Norvig
3. Combining Weak Methods in Large.Scale Text Processing m Yorick Wilks, Louise Guthrie, Joe Guthrie, Jim Cowie 4. Mixed-Depth Representations for Natural Language Text - - Graeme Hirst, Mark Ryan 5. Robust Partial-Parsing through Incremental, Multi-Algorithm Processing u David McDonald 6. Corpus-Based Thematic Analysis m Uri Zernik
Part II: "Traditional" Information Retrieval 7. Text Retrieval and Inference - - Bruce Croft, Howard Turtle 8. Assumptions and Issues in Text-Based Retrieval - Karen Sparck Jones 9. Text Representation for Intelligent Text Retrieval: A ClassificationOriented View - - David Lewis 10. Automatic Text Structuring Experiments - - Gerald Salton, Chris Bucldey
Part III: Emerging Applications 11. Statistical Methods, Artificial Intelligence, and Information Retrieval Craig Stanftll, David Waltz 12. Intelligent High-Volume Text Processing Using Shallow, DomainSpecific Techniques - - Philip J. Hayes 13. Automatically Constructing Simple Help Systems from Natural Language Documentation u Yoelle Maarek 14. Direction-Based Text Interpretation as an Information Access Refinement - - Marti Hearst