ARTIFICIAL INTELLIGENCE
307
Automated Analysis of Instructional Text Lewis M. Norton Laboratory of Statistical and Mathematical Methodology, Division of Computer Research and Technology, National I n s t i t u t e s of H e a l t h , B e t h e s d a , M D 2 0 2 0 5 , U . S . A .
Recommended by Donald Walker
ABSTRACT The development of a capability for automated processing of natural language text is a long-range goal of artificial intelligence. This paper discusses an investigation into the issues involved in the comprehension of descriptive, as opposed to illustrative, textual material. The comprehension process is viewed as the conversion of knowledge from one representation into another. The proposed target representation consists of statements of the eROLOG language, which can be interpreted both declaratioely and procedurally, much like production rules. A computer program has been written to model in detail some ideas about this process. The program successfully analyzes several heavily edited paragraph~ adapted from an elementary textbook on programming, automatically synthesizing as a result of the analysis a working PROLOO program which, when executed, can parse and interpret LET commands in the BASlC language. The paper discusses the motivations and philosophy of the project, the many kinds of prerequisite knowledge which are necessary, and the structure of the text analysis program. A sentence-by-sentence account of the analysis of the sample text is presented, describing the syntactic and semantic processing which is involved. The paper closes with a discussion of lessons learned from the project, possible alternative approaches, and possible extensions for future work. The entire project is presented as illustrative of the nature and complexity of the text analysis process, rather than as providing definitive or optimal solutions to any aspects of the task.
I. Introduction One of the long-range goals of artificial intelligence and computational linguistics is to provide computers with the ability to read books and to assimilate the knowledge contained within them. Lederberg [1] has stated, while discussing the difficulty of accumulating the expertise used by the knowledge-based programs in the SUMEX-AIMendeavor: " W e ' r e nearing the time when textbooks will be read and interpreted by machines." A more extensive reference to the same idea comes from Larkin et al. [2]. Artificial Intelligence 20 (1983) ~)7-344 0 0 0 4 - 3 7 0 2 / 8 3 / $ 0 3 . 0 0 © 1983 North-Holland
308
L,M. N O R T O N " A beginning physics student listens to lectures, studies a textbook, and works problems. O n e avenue to an understanding of learning processes would be to write programs that would allow a computer to read t e x t b o o k s . . , and work problems and thus reach some level of skill and knowledge in physics. Some hypotheses are necessary to account for the shape the learning program is to take. The most promising candidates are adaptive production s y s t e m s . . . ' "
Larkin et al. go on to emphasize learning from the examples which appear in a textbook, and from the working of further examples presented in the form of exercises. In contrast, this paper emphasizes mastering the facts stated declaratively in the body of a textbook, i.e., appropriating the knowledge set down by the author(s) for the student to make his/her own. In the artificial intelligence literature, there is a relatively large amount of discussion of 'learning by example', or induction, including methods whereby such a learning program may modify its acquired productions as indicated by trial and error. There is, however, much less discussion of 'learning' via the 'comprehension' of prose; indeed, it is fair to argue that such a capability is not 'learning' at all, but merely(?) the conversion of knowledge from one representation to another. Be that as it may, the goal is the development of a computer system which can acquire a knowledge base which can be used by it or by another computer system. The quotation from Lederberg notwithstanding, such a program for reading textbooks is not just around the corner. After all, textbooks are written by people for people, and use an overwhelming number of stylistic and pedagogical techniques in the quest for attractiveness and effectiveness. Authors vary their style in order not to be dry and boring, they appeal to the students' broad experience by the use of analogy and touches of humor. All of these devices, which by and large improve a textbook, are of no particular help to a computer program analyzing the text, in fact they are a definite hindrance. The program would need a very extensive syntactic capability in order to be able to process all of the grammatical constructions used in almost any textbook. Successful processing of illustration by analogy would require a broad general knowledge base, while the task is to acquire a specialized knowledge base. Conceivably in the future textbooks might be written for both people and computers, but at best that is only a distant possibility today. More realistically, one could take prose of today and edit it, paraphrasing complex and unusual constructions, deleting analogies (particularly if the information conveyed by them is redundant, serving only to emphasize a previously stated point), and making other revisions with the capabilities of a program in mind. Surely such a process could be simpler than the present lengthy interaction between expert and computer programmer which presently is the usual means of eliciting the subject matter for knowledge-based systems. Thus, it is not necessary to prepare an exhaustive grammar of English, or to solve what sometimes is glibly called 'the artificial intelligence problem', in
AUTOMATED ANALYSIS OF INSTRUCTIONAL TEXT
309
order to make meaningful progress toward the goal of reading and interpreting textbooks by computer. In fact, it is appropriate initially to m a k e rather severe restrictions on the natural language input (the edited version of the original textbook) in order to explore the process of conversion to a new representation of knowledge that is more useful for the computer. In other words, the syntax, but not the semantic content, of the input text should be limited, in order to facilitate the discovery of methodologies for manipulating the information embodied in the text. One of the most important decisions to be made is the choice of the target representation of the knowledge. On this point we are in essential agreement with Larkin et al. [2]; the most promising approach appears to involve some sort of production system. We propose to represent knowledge using statements of the programming language PROLOO [3]. eROLOG statements (also called rules or clauses) are, in general, implications in first order predicate calculus (simple facts are represented as implications with null antecedents), and have an obvious interpretation as productions. In addition, because of the dual interpretations of logic statements as declarative or procedural (see, e.g,, [4]), there is an important advantage to using PROLOG statements: knowledge which is essentially declarative in its natural language formulation can be converted to a declarative logical representation, and without further conversion it will be in a procedural representation. The process alluded to in the phrase 'adaptive production system' becomes the process of augmenting and refining a PROLOG program, somewhat in the spirit of automatic programming. The purpose of the project being reported in this paper is to investigate in detail what would be involved in analyzing a sample text. Our approach was to select a passage from a textbook on programming, and to consider how its content could be transformed from its original English-language presentation into a representation as executable PROLOG statements. We relied mainly on introspection to guide the creation of a text analysis program which provided a concrete model of our ideas. We certainly do not claim that any of the methods incorporated in our analysis program are optimal solutions to any of the subproblems involved in text analysis. We do hope that our program considered as a whole gives an accurate illustration of the kinds and complexity of the methods that will be required for this endeavor. Paradoxically, the creation of an actual operational text analysis program at this time is both necessary and premature. It is necessary to ensure that the various procedures which we consider actually accomplish what they are intended to accomplish, and that they are consistent with each other. The complexity of the text analysis task makes it important to verify effectiveness and consistency by a computer model whiCh can be executed. This is important even when only a fraction of the methods that will ultimately be required is being considered. Without such a model we become bogged down in detail, overwhelmed by complexity, and we find ourselves unable to gain confidence in
310
L.M. NORTON
our proposals or even to consider them at the requisite level of detail. Yet certainly it is premature to write a text analysis program at this time. We know so little; the task is so complex. Our program can scarcely be expected to stand the test of time, our methods surely will have to be generalized many times, and we must be careful not to leap to the conclusion that even a portion of a working program provides a definitive answer to a problem of understanding or simulating a cognitive process. Even in the light of the above warnings and in the light of the many compromises which will become apparent in the ensuing discussion, we feel that our program, and the project as a whole, are a substantive contribution toward the goal of automated text processing, if only because this is the first effort (that we know of, anyway) to work through in detail, at the level of a working computer program, the processing of a significant amount of textual material to obtain an executable representation of its content. (However, Silva and Dwiggins [5] report on another project using a PROLOG program to process multiple sentences of text. Their program converts the content of input text into non-executable data structures which are similar to instantiations of frames or scripts.) The text which we h a v e considered concerns programming in the BAS1C language. The original material was taken from an early BASIC instructional textbook [6]. We wanted material which contained knowledge of a procedural nature, and which was both fairly elementary and reasonably self-contained. This makes it easier to demonstrate the level of 'comprehension' attained by the text analysis program. A single introductory chapter on programming in BASIC which discusses only the LET, P R I N T , G O T O , R E A D and D A T A statements contains enough information to enable one to write a working BASIC program, albeit a very simple one. Successful analysis of the entire chapter would result in the text analysis program acquiring enough information to write a tiny program in BASIC. We have chosen not only to convert the material in the text into PROLOG statements, but to write the text analysis program itself in PROLOG. When reading the rest of this paper, it will be helpful for the reader to realize that because of the subject matter of the text we analyzed, the text analysis program we have written is a PROLOG program which processes text and itself writes a PROLOG program which, when executed, will deal with (parse, interpret, or write) simple BASIC programs. It is of interest that the task of text analysis and the particular subject matter we have chosen, the task of programming, share an involvement with both syntax and semantics. Dealing with the semantics of the text necessitates analyzing the syntax in which it is presented. The subject matter being communicated covers both the syntax and semantics of BASIC programs and their subcomponents. Furthermore, syntax and semantics are not dealt with separately. Rather, the connections between them must be elucidated during the text analysis process.
AUTOMATED ANALYSIS OF INSTRUCTIONAL TEXT
311
2. Prerequisite Knowledge Of course, there is a vast amount of knowledge required to read a text on any subject. This knowledge is of various types; here we mention three. First, there is subject-related background knowledge which the reader is expected to possess. For instance, the discussion of expressions in the BASXC textbook assumes that the reader knows some facts about algebra. On a much more elementary level, the reader is assumed to know what integers and letters are. There is a great deal of such background knowledge which is prerequisite for understanding a text. Indeed, it is hardly an exaggeration to say that one must know nearly everything about a subject before one can learn it! For our text analysis program, most of the necessary subject-related background knowledge is in the form of pre-existing PROLOC rules. The text analysis program does not execute these rules, but they are available as data for reference during the text analysis process. The PROLOG rules which the text analysis program synthesizes are added to this knowledge base, and the combined sets of rules form the program which will parse, interpret, or compose BASICstatements. A second type of prerequisite knowledge needed for reading and understanding any textbook is primarily linguistic. A partial listing of this linguistic knowledge includes lexical information about words, information about syntax and parsing, about morphology (affixes), about resolution of pronominal references, about focus of discourse (interpretation of context), etc. This type of knowledge is represented in two ways for the text analysis program. Much is embodied in the PROLOGrules forming the program itself, and some is encoded in a lexicon or table of words and information about them. An advantage of using PROLOC not mentioned earlier is that no parsing subroutine needs to be created; the PROLOG interpreter is itself the parser. The third type of prerequisite knowledge is mostly subconscious for the human reader, though it is no less crucial. Somehow we store into our long-term memories the information we obtain from reading, in a form such that we can retrieve it and use it appropriately when we want. The counterpart to this knowledge essentially is those portions of the text analysis program which convert the output of syntactic and semantic analysis into PROLOG statements. If one writes in PROLOG an interpreter for a subset of BASIC with a minimal diagnostic capability, and augments it with a rudimentary ability to synthesize BASIC statements, one has a program which does most of what the program which the text analysis process is to synthesize is able to do. Early in the project we did this; the PROLOGstatements in the resulting interpreter could be partitioned into two classes. One class consisted of those statements representing knowledge mentioned nowhere in the text, and corresponded to the subject-related background knowledge discussed above. The other class contained the rules which wholly or in part represented knowledge in the text.
312
L.M. N O R T O N
Naturally enough, since we did not create this interpreter by analyzing the text sentence by sentence, there was little correspondence between these rules and textual units of information (clauses, sentences, or sets of sentences). Clearly a text analysis program would not convert the information in the text, no matter how it was edited, into anything close to this form. However, the first class of statements served as a first approximation to the subject-related background knowledge needed to process the text, and the second class at least indicated the amount and nature of the knowledge which has to be converted into PROLOG rules by the text analysis process.
3. The Text Analysis Program We first describe some of the general characteristics of the PROLOG program we have written to analyze text. It is designed to process text in sections; i.e., part of the editing process is assumed to be the division of the text into groups of one or more paragraphs of closely related semantic content. The program processes a section of text sentence by sentence. For each sentence, a lexical lookup procedure attempts to locate each of its words in a lexicon. Regular instances of the suffixes -s, -es, -ies, -ed, -ied and -ing are handled automatically. Words in the lexicon are assigned one or more parts-of-speech, and there is a list of features or attribute-value pairs for each word/part-of-speech combination. Fairly extensive manipulation of feature lists is carried out throughout the parsing activity. The g r a m m a r which is used to parse sentences operates from left to right and is essentially predictive. Whenever a new word is about to be considered, a g r a m m a r rule initiates a request to see if the new word is in a given syntactic category. Along with the request it supplies the name of a procedure to be executed if the word may be used in that syntactic role. However, a specialized procedure may be associated with any word/part-of-speech combination in the lexicon via its feature list, and if such a lexically indicated procedure exists for a word, that procedure will be executed instead of the default procedure named by the most recently active g r a m m a r rule. In the spirit of case grammars, verbs are of central importance. When a verb is expected, the null procedure is given as the default, i.e., each verb must have its own procedure associated with it via the feature list of" its lexical entry. Thus, temporarily switching to A T N terminology, the parsing of the most prevalent sentence structure, N P + VP, proceeds by PUSHing for an NP, following standard arcs for modals, auxiliaries, and pre-verbal adverbial modifiers, and then following a lexically defined set of arcs for the main verb and all structures following it. The lexically indicated procedures for verbs indicate appropriate particles, how many NP's may follow the verb and what roles they play, and appropriate transformations, some of which involve the semantics of the verb. The lexically indicated procedures, or parsing rules, associated with verbs not
AUTOMATED
ANALYSIS OF INSTRUCTIONAL
TEXT
313
only specify the remaining syntactic and semantic analysis of a sentence, but the synthesis of any appropriate PROLOG statements. That is, they also specify the change of representation of the information embodied in the analyzed sentence. In addition, they perform two other functions. One is to choose from among the nominal concepts appearing in the sentence (occasionally the concept may appear only implicitly) at least one which is recorded as a possible focus or topic for further discussion. The intent is for such distinguished concepts to be available as possible antecedents for subsequent anaphoric references. The other function is to return a 'summary' of the sentence in the form of a single PROLOC term. At the top level these terms are simply printed and then discarded, but in the case of embedded sentences they are available to higher-order procedures. Examples of the utility of these two functions will be seen in the detailed discussion of the operation of the program on particular sections of text. An example of the use of a lexically indicated procedure for a word other than a verb is the following. When the program is parsing an NP, eventually it is ready to accept a noun, e.g., the NP so far might be 'a large', and the next word might be 'number'. The noun 'number' has its own procedure (which also would be associated with 'quantity', 'amount', and a few similar words), which checks if 'of' is the next word. If 'of' is not the next word, the lexically indicated procedure yields to the default procedure, but if 'of' is present, it treats ' a . . . number of' as a quantifier and initiates a search for an appropriately restricted NP. Thus the phrase 'a number of containers' is never considered to be of the form NP + PP, but is transformed 'on the fly' into the equivalent of 'many containers'. (Readers familiar with PARSIFAL [7] will note its influence in our treatment of 'a number of'.)
4. The Text Being Analyzed We have not yet completed our investigation of what is involved in analysis of the entire introductory chapter of the BASIC textbook. We have, however, considered four sections of the text, covering variables, constants, expressions, and the L E T statement, i.e., we have progressed to the statement level rather than to the program level. The edited version of the text considered thus far reads as follows (capital letters denote the start of a new text section, as determined during the editing process). A. T h e c o m p u t e r is p r o v i d e d with a n u m b e r of c o n t a i n e r s . E a c h of t h e c o n t a i n e r s can h o l d a n u m b e r . E a c h c o n t a i n e r has a n a m e . T h e r e are twenty-six c o n t a i n e r s with s i m p l e o n e - l e t t e r n a m e s . T h e r e m a i n i n g c o n t a i n e r s h a v e twoc h a r a c t e r n a m e s . T h e first c h a r a c t e r m u s t b e a letter. T h e s e c o n d c h a r a c t e r m u s t b e a digit. W e often use the n a m e of a c o n t a i n e r to i n d i c a t e t h e n u m b e r in it. W e often r e f e r to t h e n a m e of a c o n t a i n e r as a variable. T h e v a l u e of a v a r i a b l e is the n u m b e r in the c o n t a i n e r n a m e d by the variable.
314
L.M. NORTON B. A constant is simply a number written in the program. A decimal point may or may not be included. Commas may not be included. A negative number must be preceded with a minus sign. A positive number need not be preceded by a plus sign. C. Expressions are formed according to the standard rules of arithmetic. Expressions may use variables as operands. Constants may be used as operands. Five basic operations are available: + addition - subtraction * multiplication /, division ** exponentiation The fifth operator may also be a circumflex. D. The form of a LET command is LET
variable = expression
A LET command says simply 1. Evaluate the expression 2. Insert the value in the variable, throwing away any value currently in the variable. T h e s e p a r a g r a p h s were a d a p t e d by extensively e d i t i n g m a t e r i a l from [6, Ch. 2]. All illustrative e x a m p l e s were deleted, since i n d u c t i o n is outside the d o m a i n of this project. A n a l o g i e s were r e m o v e d . Syntax was simplified by replacing most complex s e n t e n c e s with two or m o r e simple sentences, a n d by r e m o v i n g certain c o n s t r u c t i o n s used only for style r a t h e r than c o n t e n t . N o t e that a variety of syntactic c o n s t r u c t i o n s still r e m a i n . 5. T h e Analysis of the Text W e n o w give a s e n t e n c e - b y - s e n t e n c e a c c o u n t of the actions t a k e n in analyzing the four sections of text. W e orient the discussion in t e r m s of the detailed o p e r a t i o n of o u r p r o g r a m , but the r e a d e r should r e m e m b e r that the p r o g r a m is a m o d e l of o u r ideas on what might be i n v o l v e d in r e a d i n g a n d u n d e r s t a n d i n g this material. In A p p e n d i x I we show, for those familiar with the PROLOG language, the rules synthesized by the p r o g r a m . 5.1. Analysis of Section A T h e text analysis p r o g r a m n e e d s to ' l e a r n ' from Section A that when writing a variable (in s o m e BASIC s t a t e m e n t , but that does not b e c o m e e v i d e n t until later in the text) o n e can use either a string of o n e character which must be a letter, or a t w o - c h a r a c t e r string consisting of a letter a n d a digit in that order. It is a s s u m e d that the specifications for letters a n d digits are already k n o w n . O n a s o m e w h a t m o r e abstract level, it is a s s u m e d that it is k n o w n that n a m e s (and, for Section B, n u m b e r s ) are entities possessing r e p r e s e n t a t i o n s as character strings, and therefore that it is necessary only to find out which particular character strings form a n a m e (or n u m b e r ) of a given type.
AUTOMATED ANALYSIS OF INSTRUCq'IONAL TEXT
315
The original text used the term 'mailbox' instead of 'container' in this section, clearly an analogy of no use to the computer. We substituted the word 'container', inasmuch as the only properties of mailboxes that are relevant to the text are that they hold things, and that these things may be put in and taken out, i.e., the semantic import of 'mailbox' (or 'container') is simply the concept of a thing which can hold something else. The program must also 'learn' from this section that containers n a m e d by variables hold numbers, and that these numbers are the values of the variables. W e now describe what happens during the processing of each of the ten sentences in the first section of text. (A.1) The computer is provided with a number of containers. T h e lexically indicated procedure for the verb 'provide' recognizes, among other structures, the pattern ' X is provided with Y', and transforms it into ' X has Y'. The result of processing ' X has Y' is to assert a concept of a particular kind of Y, namely one associated with an X. The asserted concept can serve as a topic or focus, thus establishing a context for further references to Y's. The first sentence, then, asserts the existence of a type of container possessed by a computer. The program achieves this via the assertion of a PROLOGunit clause, giving a n a m e to this type of container (which happens to be 'computer_ container', but could just as well have been 'X123'). The asserted concept has an associated feature list, on which is recorded, among other facts, the information that this type of container is associated with a computer. The asserted concept will remain active throughout the processing of this section of text. Incidentally, we have not included asserted concepts in Appendix I. Appendix I shows (with few exceptions) only those synthesized rules which, when executed, will deal with (parse, interpret, etc.) BASICconstructs, i.e., those rules which will be used when performing the task 'learned' from the text. (A.2) Each of the containers can hold a number. Similarly, the semantic content of ' X holds Y' is taken to be the assertion of a particular kind of Y, namely one in an X. 'Container' is interpreted in context; because of the previous sentence the program assumes that this sentence is referring to the particular kind of container mentioned earlier. Thus a second concept is asserted, namely a type of number called a 'number_in_ computer_container', and this concept too can serve as a topic. (A.3) Each container has a name. ' X has Y'. Now the concept of a 'computer_container_name' is asserted. Again, context is used to interpret 'container'. (A.4) T h e r e are twenty-six containers with simple one-letter names. First, a sentence of the form 'there are X with Y' is transformed into ' X have Y'. This sentence then is processed at two levels. At one level, this sentence repeats the content of the preceding one, i.e., without sentence (A.3) the text would still imply the existence of 'computer_container_names'. At the
316
L.M. NORTON
second level, the appearances of the words 'letter' and 'names' cause some critical semantic processing. A lexically indicated procedure associated with the word 'letter' incorporates our knowledge that a string of one or more letters can be a representation of something. (The same procedure is associated with 'digit', and a very similar one is associated with 'character'.) Thus the presence of the descriptor 'one-letter' initiates the synthesis of a PROLOG rule for parsing a segment of a character string. In effect, the program knows that something (as yet unspecified) can be represented by a single letter, perhaps with further restrictions on the letter. The partially created PROLOG rule is saved on the feature list of the NP being parsed. In addition, the concept of a particular type of character, namely the first character in the (single-character) representation of a '?', is asserted. This concept can be referred to later, just as any other possible topic. Not until the top level processing is completed (i.e., the processing of 'containers have names') can the program resolve the question of what it is that can be represented by a single letter, namely a 'computer_container_name'. This decision cannot be made when the NP 'simple one-letter names' is analyzed. The program can and does observe that 'names' must be the things with one-letter representations (since names are known via the lexicon to have representations as character strings). However, it realizes that 'names' must be interpreted in context. The concept of a 'computer_container_name' asserted as a result of processing the preceding sentence provides a possible interpretation, but in this situation the context must be taken from the rest of the sentence being analyzed. In fact, the program takes no notice of the previously asserted concept until the concept of a 'computer_container_name' is recreated from the current sentence. Then it observes that the two concepts are identical. Finally, the partially created PROLOG rule associated with the NP 'simple one-letter names' can be completed and asserted. Thus the program synthesizes a rule for parsing one form of 'computer_container_name'. Note the different ways X and Y from the sentence pattern ' X has Y" are interpreted in context. As implied by the discussion to this point, X is interpreted by consulting topics arising from previous sentences. Then Y is interpreted by relating it to X. Obviously, lexically indicated procedures for verbs must treat different NP's in different ways, depending on the semantics of the verb. The iexically indicated procedure associated with 'letter' causes the program to begin synthesizing a parsing rule, i.e., the presence of the word 'letter' in this sentence is taken to imply that the sentence is about the syntax of a 'computer_ container_name' as opposed to its semantics. Of course, there has to be some way of relating syntax to semantics; there is an implicit assumption that parsing is performed for some purpose in addition to simply verifying that a given character string is acceptable according to some g r a m m a r rules. For this reason, each parsing rule is assumed to label its 'output' to indicate what it is
AUTOMATED ANALYSIS OF INSTRUCTIONAL TEXT
317
(e.g., a 'computer_container_name'). The particular way we have chosen to implement 'labelling' in PROLOG requires a companion assertion, which in this case says essentially "a parsed substring labelled with the atom 'computer_ container_name' is a 'computer_container_name' ". We stress that this second assertion is not a priori necessary, but is an artifact of the conventions we imposed on the PROLOG code being synthesized. Nearly all processing which synthesizes parsing rules also synthesizes such companion assertions. (A.5) The remaining containers have two-character names. The processing for this sentence is similar in many ways to that of the previous one. At one level the sentence implies (for the third time in the paragraph) the existence of 'computer_container_names'. At the second level, the words 'character' and 'names' trigger actions much like those taken in the analysis of the last sentence. However, when the procedure associated with the word 'character' is about to begin the synthesis of a new PROLOG rule and the assertion of two new types of characters (namely the first and second characters in the representation of some as yet unknown entity), it first must take some action because the context for 'character' has changed. The preceding concept of a type of character which is the first (and only) one in the 'computer_ container_name' representation given in the previous sentence must have its status changed, because it may no longer be referred to except by a fully explicit description. This is accomplished by asserting a rule which says in effect that there are no further restrictions on such characters. Then the two new character concepts are asserted, and a partial rule is formulated saying in effect that something can be represented as two (otherwise unspecified) characters. When processing of the sentence is completed, the something becomes identified as a 'computer_container_name'. If the section of text ended here, the PROLOGrules synthesized would accept single letters and arbitrary strings of two characters as legal 'computer_container_names'. (A.6) The first character must be a letter. (A.7) The second character must be a digit. These two sentences, both of the form ' X is Y', are straightforward to process. The topics 'first character' and 'second character' are readily interpreted in context, and PROLOGrules making the appropriate restrictions on the characters in the two-character form of a 'computer_container_name' are asserted. (A.8) We often use the name of a container to indicate the number in it. The procedure associated with the verb 'to use' recognizes a pattern ' X use NP to VP' for sufficiently unspecified X (e.g., 'people', 'we'), and transforms it into ' X VP using NP'; i.e., the sentence is transformed into "we indicate the number in it using the name of a container" (the program remembers that 'the name of a container' originally preceded 'the number in it'). A sentence of the
318
LM. NORTON
form ' X indicate Y using Z ' , again for sufficiently general X, is considered to be equivalent to the sentence form ' Z denotes Y', as long as Z can have a character string representation. Processing this latter sentence form causes the assertion of a PROLOG rule which may be paraphrased: "if something is a Z, then it denotes a Y". This is not a parsing rule; rather it embodies something of the semantics of Z ' s . We shall see below how and when rules such as this are used during the analysis of the rest of the text. Z and Y must be interpreted in context, and in the order in which they appeared in the original sentence. 'The name of a container', where the context implies that 'containers' are 'computer_containers', clearly refers to the already-existing concept of a 'computer_container_name'. 'The number in it' is consistent with the already-existing concept of 'number_in_computer_container', and therefore is taken to be such by the program. This 'back-door' method of resolving pronominal reference avoids the problem, extremely difficult in general, of determining antecedents for pronouns solely on a sentence basis. H e r e from syntax alone we could not choose with certainty between ' n a m e ' and ~container' as the antecedent of 'it', and the use of semantics to help make the choice is tricky at best; 'names' have entities in them, namely types of characters (among which are digits, which form numbers), so presumably the pattern of syntactic construction would have to be considered along with the strong association of 'containers' and the things in them (or with the knowledge that things usually do not denote parts of themselves), before a resolution could be made on some preference basis. But we go beyond the sentence; there is an existing concept of a particular kind of n u m b e r which was created when processing an earlier sentence, and nothing is incompatible with assuming that it is that kind of n u m b e r which is being referred to here. (The preceding is not intended to be a substantial contribution to the problem of resolving anaphoric references. For an indication of the range of p h e n o m e n a involved in this problem, see [8].) (A.9) We often refer to the name of a container as a variable. ' X refer to Y as Z ' , for sufficiently general X (as in the preceding sentence), is taken to supply a definition (or at least a sufficient condition) for a Z. In this sentence Y, of course, is clearly a 'computer_container_name'. The program has no way of knowing whether or not the relationship between 'variable' and 'computer_container_name' will be useful in general or just in parsing. (It knows it might be useful in parsing because it knows that a 'computer_container_name' is a kind of name, and that names have representations as character strings and therefore can be parsed.) Therefore it asserts two rules, 'just in case'. One says essentially "a 'computer_container_name' is a variable", and the other says "to parse a variable, parse a 'computer_cont a i n e r _ n a m e ' " . (These rules have different forms in PROLO6; the first is a simple PaOLOG clause, the second is a so-called definite clause grammar rule.) The program also updates the lexicon, so that henceforth 'variables' are known
A U T O M A T E D ANALYSIS OF INSTRUCTIONAL TEXT
319
lexically as things which can be parsed. Finally, the program propagates the denotation asserted when processing the preceding sentence, i.e., since 'computer_container_names' are known to denote something, then (at least some) variables are now known to denote the same thing. A point to note is that 'variable' is a text-supplied concept name, not an internally generated one. While we have tried to provide for 'meaningful' internally generated concept names, there is neither assurance nor necessity for doing so; the feature lists of concepts are used to record information about the concepts. For the first time, a rule has been synthesized whose consequent (or head goal) employs a word used in the text. Such a rule can be invoked using the vocabulary of the subject material being analyzed. After processing this sentence, sufficient PROLOG rules have been synthesized to cover the syntax of variables, i.e., enough rules have been synthesized to parse variables. Though it was not mentioned earlier, many of the PROLOG rules synthesized by the text analysis program have been explicitly restricted to apply only to the overall subject matter of the text. We declared the subject matter of this text to be 'BASIC'. Strictly speaking, then, the rules resulting from processing this section refer to BASIC variables only. The importance of this restriction will become clearer as we discuss the analysis of the rest of the sections of the text. (A.10) The value of a variable is the number in the container named by the variable. This sentence, like sentences (A.6) and (A.7), is of the form ' X is Y'. However, processing it results in many more actions being taken, actions illustrative of the complexity of text analysis. First, the program must attempt to interpret X and Y in context. Here X is a noun modified by a prepositional phrase. When processing earlier sentences with thi~ construction there had been concepts asserted for the head noun. However, that is not the case for the head noun 'value'. (There have been no sentences saying "variables have values", for example.) Therefore the presence of the prepositional phrase introduced by the preposition 'of' causes the assertion of a concept of a 'variable_value', in a manner completely analogous to the cohcept assertion resulting from processing a sentence of the form ' X has Y'. The 'number in the container' is obviously a 'number_in_computer_container', and a rule is asserted stating "if Z is a variable, then the value of Z is the corresponding 'number_in_computer_container'". Unfortunately, the fact that numbers are known to have character string representations causes the assertion of a second rule saying "to parse a 'variable_value', parse a 'number_ in_compute'r_container'". This rule presumably will never be invoked when the synthesized program is executed. Note that the idea of denotation which comes from the lexically indicated procedure for 'indicate' is very close to the idea of 'value', though the program has no information to that effect. Yet the PROLOG rules synthesized for the denotation of a variable (after processing sentences (A.8) and (A.9)) and for the
320
L.M. NORTON
value of a variable turn out to have the same antecedents (or These rules cover the semantics of variables.
body goals).
5.2. Analysis of Section B Analyzing the section about variables utilized some subject-related background knowledge, in particular the knowledge that letters, digits, or characters in general form representations for concepts, which can be parsed when they occur within character strings. This knowledge was represented, among other places, in procedures associated lexically with words like 'letter' and 'character'. This particular prerequisite knowledge can be said to be in some sense on a lower level than the subject matter of Section A. The situation is quite different for the next section, which is about another possible component of BASICstatements, namely constants. As will be apparent from the ensuing discussion, prerequisite knowledge plays a much greater role in the analysis of this section. The form in which we have cast the needed subject-related background knowledge is only one possible form. It is illustrative, but not definitive, and serves, as does much of this work, merely to illustrate the issues involved in analyzing text. For convenience, we repeat Section B. A constant is simply a number written in the program. A decimal point may or may not be included. Commas may not be included. A negative number must be preceded with a minus sign. A positive number need not be preceded by a plus sign. (B.1) A constant is simply a number written in the program. The analysis of this sentence does not involve any techniques which have not already been discussed. We note that a new section implies a fresh context, i.e., the concepts asserted while processing Section A are discarded. Thus 'number' is no longer interpreted as 'number_in_computer_container', as it would have been were this not a new section. The program asserts a rule stating that constants are numbers, and in addition, since numbers are known to have character string representations, the program asserts a rule stating "to parse a constant, parse a number". Also, the lexicon is updated to show that constants are now known to have character representations (cf. the discussion of sentence (A.9)). As mentioned in the discussion of lexically indicated procedures for verbs, a concept is identified as a possible topic for further discussion. We did not always mention these asserted concepts in the description of the processing of Section A, when they were not used in the subsequent analysis. As we shall see below, the concept identified while processing this sentence, namely 'number', is used during the analysis of the next sentence. (B.2) A decimal point may or may not be included. First, how is this sentence parsed? A lexicaily indicated procedure for 'decimal' checks to see if the next word is 'point' (or 'points'), and if so, treats
AUTOMATED ANALYSIS OF INSTRUCTIONAL TEXT
321
the word pair as a fixed phrase. A n o t h e r lexically indicated procedure, this one associated with ' m a y ' , discards an ensuing 'or may not', on the assumption that it is redundant. Finally, the procedure for the verb 'include' recognizes the passive construction ' X is included in Y'. However, there is no 'in Y'. The program allows its absence to be an implicit 'in it', where the antecedent of 'it' must be able to have a character string representation. The program searches for such an antecedent among the concepts which have been asserted during the processing of this section, and finds a suitable candidate in the concept of ' n u m b e r ' which was asserted while processing (B.I). So (B.2) essentially is transformed into "a decimal_point may be included in a number". But what is to be made of this? Surely the answer depends upon what is assumed as prerequisite knowledge about numbers. We have chosen to address this issue in the following way. Presumably a reader of this text has had varied experience in dealing with numbers. For instance, he or she may know about integers, decimals, fractions, maybe even complex numbers or numbers written in exponential notation. Also, presumably he or she has adopted a sort of defensive strategy, assuming that when the concept of n u m b e r is introduced in a new context, it does not mean all possible types of numbers except those explicitly restricted, but instead it means something like "strings of digits, with perhaps some other characters allowed in the beginning of the character string (e.g., a sign), somewhere within the character string (e.g., a decimal point, a slash for a fraction, etc.), and at the end of the string (e.g., some representation of a power of ten, or a base)". He or she will know that there may be constraints on these extra characters, such as no m o r e than one slash in a fraction. This is the sort of prerequisite knowledge we assume for (the syntax of) numbers, which reduces to knowing possible initial, intermediate, and final characters for the representation of a number, along with rules for parsing numbers based on such information. The 'working hypothesis' is that there are no known initial or final characters, and that one or more digits may be intermediate characters (i.e., a number must contain at least one digit, and there is no limit on how many digits). Now we can describe what the procedure for 'include' must do to process this sentence. Since there is no qualifying phrase such as 'at the beginning' or 'after the digits', it must declare an additional legal intermediate character for numbers restricted to the context of aAS~C, with an appropriate constraint. More precisely, ' X includes Y' (or ' Y is included in X ' , as we have here) is processed by declaring a Y to be an additional intermediate character for an X, subject to the restrictions that X must have a character string representation (as does a number) and that Y must be a type of character (as is a 'decimal_ point'). The constraint is derived from the auxiliary (if present) and from features of the noun phrase Y. For (B.2) we have "a 'decimal_point' m a y . . . " , which becomes the constraint 'zero or one'. Some other possibilities are: " 'decimal_points' m a y . . . " ~
zero or more,
322
L.M. NORTON
" 'decimal_points' must . . . " ~ one or more, "a 'decimal_point' m u s t , . . " ~ exactly one. The result of analyzing sentence (B.2) is the assertion of PROLOG rules embodying the above information, so that in effect the sentence conveys the message "zero or one 'decimal_points' appear in the character string representation of a number (in BASIC)". The program also asserts a concept of a particular type of character, namely a 'decimal_point' appearing in a number. We saw in the discussion of sentences (A.5), (A.6) and (A.7) how related character concepts were necessary. H e r e there is probably no need of the corresponding assertion, but during the analysis of this and the next three sentences, the program employs some of the same methodology as it did when processing sentences (A.4) and (A.5). (B.3) C o m m a s may not be included. The analysis of this sentence proceeds in a m a n n e r identical to that of the preceding sentence, until the constraint is derived. 'May not' is taken to imply the constraint 'no more than zero', i.e., none. Thus (B.3) declares commas to be legal intermediate characters for numbers, but then declares that none may occur! (B.4) A negative n u m b e r must be preceded with a minus sign. (B.5) A positive n u m b e r need not be preceded by a plus sign. The discussion of sentence (B.2) should have indicated the nature of the analysis of these two sentences. In a manner quite similar to the processing of ' X includes Y', ' Y precedes X ' causes the declaration of a Y as a legal initial character for an X, given appropriate restrictions on X and Y. ('Need n o t ' ~ ' m a y ' by a lexically indicated procedure for 'need'.) We have assumed an item of subject-related background knowledge to the effect that there is no more than one (i.e., 0 or 1) initial character for a number. Therefore it is not necessary to derive constraints from (B.4) and (B.5). Note that the assumed constraint applies to the combined total of all types of legal initial characters (thus excluding ' - + 7 ' ) , as opposed to applying to one type of legal character as in (B.2) and (B.3). Note also that this approach implicitly requires the qualifier 'negative' in (B.4); a sentence such as "a number must be preceded with a . . . " would conflict with the subject-related background knowledge we have assumed. A shortcoming of our treatment of the syntax of number representations is revealed by these sentences. In effect the program and the code it synthesizes is oblivious of the information conveyed by the use of ~negative' and 'positive'. The synthesized code certainly can parse signed numbers, but it does not e m b o d y the information that a minus sign implies a negative number, or that otherwise a n u m b e r is positive. To be candid, we have not yet considered how to use such information in the context of BASIC programming, and therefore we
AUTOMATED ANALYSISOF INSTRUCTIONALTEXT
323
have deferred the problem of deciding how to have the text analysis process assert rules to that effect. Note that there is no reference to the semantics of constants in this section, as opposed to the presence of material relating to the semantics of variables in Section A. Presumably the absence of such a reference is based upon the entirely reasonable assumption that the reader already possesses knowledge of the semantics of numbers. We have included PROLOG rules corresponding to such prerequisite knowledge. They say, essentially, that the value of a number is the number itself, and that the value of +n is the same as the value of n (necessary because PROLOG happens not to allow unary plus as an operator in integer arithmetic expressions). 5.3. A n a l y s i s of S e c t i o n C
The preceding two sections of text discuss concepts which can be components of aASlC expressions. Therefore Section C, which discusses expressions, represents a step upward in complexity from the point of view of BASXC.As will be seen, the analysis of this section, like that of Section B, relies heavily on subject-related background knowledge. Section C, in its edited form, reads: Expressions are formed accordingto the standard rules of arithmetic. Expressions may use variables as operands. Constants may be used as operands. Five basic operations are available: + addition - subtraction * multiplication / division ** exponentiation The fifth operator may also be a circumflex. (C.I) Expressions are formed according to the standard rules of arithmetic. In the course of processing this sentence, the following steps occur. When the lexically indicated procedure for the verb 'form' recognizes a construction in the passive voice, it expects to find an adverbial modifier of some kind which will describe how the syntactic subject of the sentence is formed. In this sentence, the words 'according to' are treated as a fixed phrase functioning as a preposition, so there is an adverbial prepositional phrase of the form " ' a c c o r d i n g _ t o ' . . , r u l e s . . . " . The noun 'rules' has a lexically indicated procedure which expects to find an adjectival modifier telling which rules (e.g., rules of chess, backgammon rules). In this case it is rules of arithmetic. The indicated rules are expected to be available as prerequisite knowledge, and to be accompanied by a few facts about themselves, i.e., each rule may have some associated features. (Because PROLOG does not make it as convenient as does LISPtO deal with its rules as data, we have supplied these rules in a list form.) One feature which a rule may have is a list of the subcomponents of the rule,
324
L.M. N O R T O N
whose significance will become clearer as the discussion continues. The rules are attached to the feature list of the NP functioning as the object of the preposition 'according_to' as the value of the feature 'denotation'. (The analysis of (A.8) and (A.9) involved the synthesis of rules about denotations; here we have a denotation appearing as a feature. O t h e r than the fact that both of these p h e n o m e n a deal with the representation of information about something denoting something else, there is little connection between the two. In particular, the feature is used at text analysis time, while the rules will be used when performing the task 'learned' from the text.) Since all the syntactic and semantic expectations of the lexically indicated procedures have been fulfilled, the procedure for 'form' must proceed to deal with the sentence ' X is f o r m e d . . . ' . In the case that a denotation is associated (via a feature list) with an appropriate component of the adverbial modifier, every m e m b e r (rule) of the denotation which mentions the concept X (i.e., X appears as a predicate name somewhere in the rule) is transformed from whatever kind of rule it is into a rule about the overall subject matter of the text. Here, an arithmetic rule becomes a rule about BASIC. The two rules mentioning 'expressions' become rules which can be paraphrased in English as follows. (1) A BASIC expression is either (a) a BASIC operand plus a BASIC operator plus a BASIC expression (recursive case) or (b) simply a BASIC operand. (2) A BASXCoperand can be a BASle expression enclosed in parentheses. The first of these rules has as subcomponents BASIC operands and BASIC operators. The two rules are asserted, and in addition, the two classes of subcomponents are recorded as topics which need further elaboration. If the above seems overly ad hoc, our only response is that all of the above or something similar to it must be done in response to the presentation of this input sentence. Perhaps in the light of many further examples some structure can be imposed upon the kinds of processing which are necessary for sentences like this. For now the analysis of (C.I) remains yet another illustration of the complexity and diversity of the methods by which the content of text is construed in the light of prerequisite knowledge. (C.2) Expressions may use variables as operands. (C.3) Constants may be used as operands. In these two sentences the presence of the 'use(d) as X ' construction, where X is an expected possible topic due to the previous sentence, straightforwardly results in the assertion of rules stating that to parse a BASICoperand, it suffices to parse a (BASIC) variable or a (BASIC) constant. (C.4) Five basic operations are available: + addition subtraction * multiplication -
AUTOMATED ANALYSIS OF INSTRUCTIONAL TEXT
325
/ division ** exponentiation This sentence illustrates the fact that text is not just a linear sequence of words and punctuation. We have assumed that the editing process can use fixed conventions to transform algorithmically such two-dimensional text layouts into list structures with no loss of information. This sentence is split into two parts: the sentence "Five basic operations are available", and a list of two-element lists, which the analysis process must interpret to be the five operations. T o process the initial sentence, there must be a lexically indicated procedure for 'available', as it plays a verb-like role in the semantics of the sentence. The procedure must resolve the question "available for what?". There were topics asserted in previous sentences, namely operands and operators, but neither provides a direct answer to the question. T o m a k e the required connection, there must be prerequistite knowledge, which we assume to be supplied lexically, to the eltect that operations have associated names (e.g., 'addition') and operators (e.g., '+'). (Note that the name of a particular operation is distinct from its character representation, which is the operator.) When such knowledge is present, the procedure for 'available' can determine that this sentence relates to the previously introduced topic of 'BASICoperators', and can assert a rule stating that a BASICoperator may be any of five different operators. It can also assert the existence of five new concepts or topics, namely the first, second . . . . . and fifth types of operators. Since (C. 1) implied a need to gather more information about operators, the program assumes that the concept of an operation is of lesser importance, and rather than asserting five more concepts corresponding to each of the five operations, it asserts only one further concept for use as a future topic, namely the concept of the set of 'five operations'. The list corresponding to the second part of (C.4) then can be seen to correspond to the most recently introduced topic. It has the same cardinality (five), and furthermore, each operation is expected to have at least two attributes, an associated name and operator. Which attribute is the name is determined by which has a lexical entry ('addition', not '+'). Then rules relating each type of operator with its character string representation can be synthesized and asserted. (C.5) The fifth operator may also be a circumflex. The 'fifth operator' is readily identified in the context created by analyzing the previous sentence. A circumflex is a type of character, so the appropriate parsing rule can be readily asserted. T h e processing is similar to that of (A.6) and (A.7). Minor differences in detail are due to the fact that the rule asserted here must be a definite clause g r a m m a r rule, while the earlier rules were regular PROLOG clauses. The lexical knowledge that operators have character string representations (and characters themselves do not) provides the basis for distinction.
326
L.M. N O R T O N
As in Section B, there is no explicit discussion of semantics in this section. We assume that it is known via subject-related background knowledge how in general to evaluate expressions in terms of the values of their operands, i.e., we assume that rules embodying the semantics of expressions are available. Of course, BASIC expressions have one more type of operand than do ordinary arithmetic expressions, namely variables. Section A provided the information needed to be able to evaluate variables. The obvious danger in the way this section was analyzed is that all of the procedures and methods used were developed knowing what they would have to do to analyze the section successfully. For instance, operations had to be known to have associated names and operators. As discussed in the introduction, it simply is not possible to consider large enough amounts of text to guarantee the generality of methods before seeing if proposed methods for handling small amounts of text are sufficient to do even that.
5.4. Analysis of Section D With the final section of text to be discussed in this paper, we move one level higher in the semantics of the BASIC language, namely to the statement level (without the line number). The text of Section D follows. The form of a LET command is LET v a r i a b l e = e x p r e s s i o n A LET command says simply I. Evaluate the expression 2. Insert the value in the variable, throwing away any value currently in the variable
Again we assume algorithmic editing conventions for transforming the two-dimensional aspects of the text into some sort of information-preserving input structures. Below we show one possible version of such structures for each sentence. Note that we have also preserved information about the case and font of the input characters. (D.I) The form of a L E T c o m m a n d is (LET, (italics, variable), =, (italics, expression)). This sentence is processed by means of a rule which recognizes the pattern ' X of Y is LIST'; the rule is applied only if X has a lexically supplied feature indicating that it can a p p e a r in this template. The parsing rules which handle the various uses of the copulative verb 'to be' check for this situation (and many others!). The word ' L E T ' in the noun phrase instantiating Y is observed to correspond to the string ' L E T ' in LIST (both are in upper case). The process of asserting a PROLOG rule for parsing a Y (here, a ' L E T c o m m a n d ' ) then is begun; the first step is to synthesize its body goals by analyzing LIST. Members
AUTOMATED ANALYSIS OF INSTRUCTIONAL TEXT
327
of LIST not in italics are assumed to be quoted character strings. Each italicized member of LIST is matched with previously existing rules to determine the proper form for the goal corresponding to it. The relevant previously existing rules needed to do this for this sentence were synthesized when analyzing (A.9) and (C.1). The result of analyzing sentence (D.I) is the assertion of a rule for the syntax of a L E T command in the BASIC language; this rule embodies the fact that parsing a L E T command results in obtaining parsed versions of a (BASIC) variables and a (BASlC) expression. (D.2) A L E T command says simply ({evaluate the expression}, {insert the value in the variable, throwing away any value currently in the variable}). The analysis of this sentence is the culmination of the analysis of the four sections of text, and indeed the results (synthesized rules) obtained up to this point must dovetail with the results of analyzing this sentence. The course of the analysis is somewhat more complex than that of the previous sentences. The lexically indicated procedure for the verb 'say' recognizes the pattern ' X says LIST' where (i) a syntax (parsing) rule already exists for an X ; and (ii) LIST is a list of sentences in the imperative mood. LIST is then processed, and the analysis of each imperative sentence is expected to contribute one or more goals for a rule expressing the semantics of an X. While processing LIST, the syntax rule for an X is consulted in order to identify the concepts appearing in the imperative sentences. We discuss the imperative sentences individually. (D.2.1) Evaluate the expression. Recall that the analysis of every sentence results not only in the assertion of zero or more rules and one or more concepts, but also in the return of a 'summary' of the sentence in the form of a PROLOG term. Thus a goal is provided 'automatically' by the normal processing of a sentence. As an example, the term yielded when analyzing (A.l) was 'has([computer,.. FLI], [computer_container,.. FL2])'. The notation is intended to convey the fact that the term contains not only the concept names but also their (final) feature lists Considered in isolation, processing sentence (D.2.1) has the following consequences, all due to straightforward use, by the lexically indicated procedure for 'evaluate', of methods described previously. First, a concept of an 'expression_value' is asserted, as if the sentence had read "an expression has a value" or " . . . value of an expression...". Then, the term 'value([expression,.. FL],V)' is yielded, where V is a (PROLOG)variable. This is a summary of the sentence taken out of context, embodying the idea "the result of evaluating the expression is V". Note that 'expression' is accompanied by its feature list. The higher-level processing of LIST, before proceeding to the next imperative sentence, attempts to interpret this term in context. It uses the syntax rule for a L E T command to identify 'the expression' with one of the two subcomponents which will be isolated when parsing a L E T command. This
328
L.M. NORTON
identification makes the connection between the syntax and the semantics of an expression appearing as part of a L E T command. The term is transformed into the desired goal by an appropriate substitution. The new goal, as well as the syntax rule, becomes available for reference during the analysis of subsequent imperative sentences. (D.2.2) Insert the value in the variable, throwing away any value currently in the variable. This imperative sentence is split into two imperative sentences. (D.2.2.1) Insert the value in the variable. The lexically indicated procedure for 'insert' recognizes the pattern 'insert X in Y'. The interpretation of 'value' is 'expression_value' because of the context which was set when analyzing (D.2.1). In contrast, no context (in the form of an asserted concept) was set for 'variable' while processing this section. It might be expected, therefore, that the analysis of this imperative sentence would return something like 'insert(expression_value,variable)'. However, the procedure for 'insert' embodies some knowledge about methods corresponding to procedures the human brain uses when it stores information in long-term memory. In particular, it knows that the way information is manipulated in PROLOG implies that 'inserting' involves an assertion of some kind, i.e., rather than the term 'insert(X, Y)', it is more appropriate to return something on the order of 'assert(functor(X,Y))' for some functor or predicate name. What the functor should be depends upon what it is that a Y denotes. For this sentence it is known, from processing Section A, that a variable denotes a 'number_in_ computer_container'. This is one of the key points that makes the complete analysis 'hang together'. Because this connection can be made, the term 'storeassert(number_in_computer_container,bASIC, [variable,.. FL1],[expression_value,.. FL2],a)' actually is yielded. This is essentially a syntactic variant of the PROLOG goal 'asserta(number_in_computer_container(bASIC, variable,expression_value))', which we have defined and used because it is easier to work with, having only one level of function nesting. 'Variable' and 'expression_value' have associated with them the feature lists created during the parsing process. (The reason for using the form 'bASIC' is given in Appendix I.) The second part of (D.2.2) is now processed. (D.2.2.2) Throw away any value currently in the variable. The procedure for, 'throw_away' is very similar to that for 'insert'. It recognizes forms like
ft'r°wawayl. {fr°m1 discard erase
[ J
X
in out of
Y.
AUTOMATED ANALYSIS OF INSTRUCTIONAL TEXT
329
By methods identical to those used in the previous analysis, the term yielded is 'storetract (number_in_computer_container,bASIC, [variable,.. FL1],[expression_value,. • FL2])', which is essentially a syntactic variant of 'retract all(number_in_computer_container(bASIC, variable,expression_value))'. The higher-level processing of the list of imperative sentences now attempts to interpret these two terms in context. For both, 'the variable' is identified with the other subcomponent which will be isolated when parsing a L E T command. In the first term, 'the expression_value' is identified with the value mentioned in the first goal synthesized from this sentence (the one from (D.2.1)). The reader may have wondered why we included the definite article in the paraphrases of the concepts above. The reason is that the information conveyed by the article remains available via the feature list, and is used in the subsequent identification process. In particular, the definite article results in a check to verify that the identification is uniquely determined, e.g., there is only one 'variable' mentioned in the syntax rule for a L E T command. The concept 'any expression_value', which occurs in the second term, requires a different treatment. 'Any' is taken to mean 'no particular one', i.e., not one of any available matches. Instead, the concept is replaced by a free variable. Finally, the two goals resulting from the analysis of (D.2.2) are interchanged. It was necessary to analyze them in their given order because of the possibility of anaphoric reference in their English formulations. However, the interchange is made on the theory that there would have been two separate imperative sentences, rather than the construction actually used, if no interchange were in order. There being no more imperative sentences in LIST, the three goals synthesized become the body goals for the rule expressing the semantics of a L E T command. It remains for the procedure for 'say' to complete the rule by synthesizing a head goal. 'Say' in this contest is taken to refer to semantics, and it is known via a lexical feature that the semantics of a 'command' or a 'statement' involves execution. Therefore the pattern ' X says LIST' is transformed into a rule of the form "to execute an X, achieve the goals embodied in LIST". In particular, the end result of analyzing (D.2) is the assertion of a rule which says "to execute a L E T command in the BASIC language, let EV be the value of the expression mentioned therein, retract any value of the variable mentioned therein, and assert that the value of that variable is now EV". Refer to Appendix I for the actual OROLOC formulations of this and the other rules synthesized by the text analysis process.
330
L.M. NORTON 6. Discussion
The rules (given in Appendix I) synthesized by the text analysis process are assumed to be executed in the presence of rules embodying subject-related background knowledge. For this particular text, there are only a few classes of such prerequisite knowledge. At the lowest level, there must be rules expressing knowledge about the character set used in forming the character string representations of BASIC statements. These include rules for recognizing blanks, letters, digits, and special characters such as punctuation, signs, the circumflex, etc. Next, there must be the rules, already mentioned, embodying knowledge about parsing numbers. Lastly, there must be rules embodying knowledge of how to evaluate expressions. R e m e m b e r that we have assumed that the reader knows how to parse and evaluate arithmetic expressions, and that a BASICexpression, once its operands are evaluated, is simply an arithmetic expression. Therefore we assume the presence of rules stating how to evaluate constants and (arithmetic) expressions, which involves (sub)rules for performing addition, subtraction, multiplication, division, and exponentiation with the usual conventions regarding precedence. One final rule is needed. Just as we needed lexical information stating that the semantics of a command involves execution, we need a rule stating that to process something, first parse it (process its syntax) and then act upon it (process its semantics). Here that rule is simply docommand(S) : - command(CONTEXT,C,S,[ ]),execute(CONTEXT,C). The input to this rule, S, is assumed to be a eROLOG string, i.e., a list of (ASCII codes for) characters. There is no explicit output; the result of doing the command will be reflected in one or more side-effects of the goal execute. Note that the rules needed to begin solving the two body goals of the d o c o m m a n d rule are precisely the two rules asserted as a result of processing Section D of the text. Where the docommand rule comes from (i.e., whether we are justified in assuming it as prerequisite knowledge) is a bit problematic, and we will return to this point later. The synthesized rules, in the presence of the other rules mentioned above, have been extensively tested, and work correctly. For instance, execution of the goals d o c o m m a n d ( " L E T A -- 1"). d o c o m m a n d ( " L E T B = A + 1"). d o c o m m a n d ( " L E T C = 3"). d o c o m m a n d ( " L E T V1 = (A + B + C)/2"). results in the assertions number_in_computer_cont airier(bASIC,[variable,A], 1). number_in computer_container(bASIC,[variable,B],2).
AUTOMATED ANALYSIS OF INSTRUCTIONAL TEXT
331
number_in_computer_cont ainer(bASIC,[variable,C] ,3). number_in_computer_container(bASIC,[variable,V I ],3). That is, the appropriate assignments are reflected in the assertions. (The intended interpretation of the assertion 'number_in_computer_container(bASIC,X, Y)' is "in the context of BASIC, Y is the number in the 'computer_container' named by X " . ) For those who might be interested, we mention some details pertaining to the task of testing the synthesized rules. Since PROLOG has no provisions for floating-point arithmetic, the rules which we supplied for evaluating expressions embody a combination of rational arithmetic and a scheme involving integers stored with implicit decimal-point indicators. Exponentiation is defined only for integral exponents. The exact form of these rules is, of course, irrelevant from the point of view of testing the synthesized rules. That the rules assumed to embody subject-related background knowledge concerning the evaluation of expressions probably do not accurately reflect the semantics of any implementation of BASICpoints OUt a danger of the textbook's assuming consensus on the part of its readers concerning subject-related background knowledge. Alternatively, perhaps we are to put more stress on the word 'formed' in sentence (C.l), and take the text as informing its readers that the syntax of BASICexpressions is like that of arithmetic expressions, while saying essentially nothing about the semantics of BASICexpressions. Under this interpretation, we cannot predict from the text how BASIC will evaluate an expression, even though we know how to evaluate arithmetic expressions. Support for this interpretation is given by the beginner's surprise at finding that 2.0 * 2.0 may be 3.999999. Since the synthesized rules do, in fact, work correctly, this text analysis project has succeeded in working through in detail the processing of multiple paragraphs of textual material of an instructional nature. We have seen how the semantic content of the various sections of the text interrelates, as we have seen how the synthesized rules interact with each other and with rules embodying prerequisite knowledge. From this experiment, we have accumulated many observations for discussion, as well as a number of insights, on various levels, into the nature of the task of reading and interpreting an instructional text. Our primary observation concerns the great difficulty of the text analysis task. Even given the limits we set for the project--subject material of a procedural nature readily representable as production rules, restricted syntax, elimination of metaphor, analogy, etc., elimination of illustrative examples (which would necessitate techniques of induction)--we were able to progress as far as we did only with great difficulty. Partly this was due to limitations on the resources that were available for this project, but primarily it was due to the fact that nearly every sentence of the text required new techniques for its analysis. This complexity was expected, of
332
L.M. NORTON
course, but its presence makes it at least problematic whether the state of the art will support text analysis of multiple chapters of material at this level of detail in the near future. The magnitude of the text analysis task can not be overestimated. We do feel, as we have indicated previously throughout this paper, that the various techniques we have used for analysis are illustrative of the kinds of techniques that are required for this task. Though more optimal techniques may differ in detail f r o m those we have invented and used, their general orientation and level of difficulty probably will be similar to those of our techniques. It will be necessary to represent detailed information accessible lexically for the syntax, semantics, and pragmatics of most words in the vocabulary of the text. Particularly for subject material of a procedural nature, it will be necessary to represent knowledge in some form of production rule. It will be necessary to have a large body of prerequisite knowledge available to support the analysis process. A n d for the forseeable future, it will be necessary to edit manually the text to be analyzed. We now discuss in more detail the effectiveness and appropriateness of particular methodologies that we used. PROLOC was an excellent choice for the programming language in which to code the text analysis program, as well as being appropriate, for reasons mentioned earlier, as the language in which to represent knowledge. We could not have done this project without the availability of PROLOO. The power of its control structure, essentially supplying a parser for free, and the attendant modularity of PROLOGprograms, far outweighed any disadvantage (vis-a-vis LISP) in manipulating PaOLOG statements as data. As the project proceeded, and we added more and more different methods for analyzing text, we found it necessary to make only relatively minor, straightforward changes to the PROLO~ statements already existing. The device of lexicaily indicated procedures proved to be very useful, providing great flexibility as well as power. Using this device it is easy to handle idiomatic, non-standard or alternative uses of words. Furthermore, the wisdom of relying entirely on lexically indicated procedures for verbs became clear almost immediately, Our approach to parsing combines much of the methodology of augmented transition networks (see [9] for a discussion of the relation of PROLOG definite clause g r a m m a r rules to A T N rules) with the flexibility of PARSIFAL-type techniques [7]. Our use of lexically indicated procedures for verbs draws upon some of the same philosophy as does "wordexpert parsing' [10], though we do not extend this to all words or to a true co-routine control structure. One concern is that the association of a number of patterns with each verb, via lexicaily indicated procedures, may lead to an undisciplined collection of too large a n u m b e r of facts. We suggest, however, that (i) people in fact require and use this much information in reading and understanding text; and (ii)
AUTOMATED ANALYSIS OF INSTRUCTIONAL TEXT
333
similarities between portions of the rules comprising the lexically indicated procedures, which are already becoming identifiable, correspond to structure which can be imposed upon this body of information. Fortunately, the modular nature of a PROLOG program makes it possible to take advantage of such structure as it is perceived, without making wholesale revisions to a large percentage of the program. One possible approach to systematizing the treatment of verbs is to adapt the methods developed by McCord [l I]. There are other methodologies that could have been used for this project. We could have accumulated protocols of people learning how to program in BASIC, instead of relying on introspection to ascertain how the text conveyed its information to a reader. It is possible, for instance, that this approach might have led to the use of an entirely different form in which to cast subject-related background knowledge on how to parse numbers. Protocol analysis for such a purpose probably is a good idea. However, it was beyond our resources for this project, and more importantly, at this stage of knowledge we could (and did) proceed to address the task of text analysis in sufficient detail and with adequate precision (for the present) without the results of protocol analysis. Again, our simulation of the analysis of the sample text is intended only to be illustrative, not definitive, and we make no claims that its details are psychologically valid as operations of human cognitive processing. Also, we could have used a different representation for knowledge, or used more than one representation. Semantic nets, or particular implementations of the concept of frames, such as scripts, could make certain interrelations among facts easier for the program to manipulate. We certainly do not mean to rule out the use of such representations in the future. We do claim, however, that representation of knowledge solely as PROLOG rules has appeared to be adequate thus far, and we suspect that it will continue to be adequate for use with many additional techniques for analyzing text. We now discuss some particular insights obtained while working with the particular sections of text we chose. It is striking how much subject-related background knowledge is assumed. The reader must be quite well prepared for exposure to a new topic, and (though he or she may not realize it) the material will not be greatly divergent from concepts already mastered. Even so, it is easy for the author of a textbook to omit explicit mention of crucial 'connective' concepts. An example of such an omission is found when considering Section B of the text. T h e r e we learn that " a constant is a n u m b e r written in the program", a statement we take as emphasizing syntax because of the presence of the word 'written'. When, later on, we need to know how to evaluate a constant, we come to the conclusion that the value of a constant is the value of the number which is the constant. The point is that a tiny deduction is needed here, and we glossed over this issue entirely by supplying rules embodying subject-related background knowledge about the value of a c o n s t a n t rather than a number. T o be more correct, at
334
L.M. N O R T O N
some point the need to evaluate a constant must trigger the synthesis of a new rule based on the fact that "a constant is a n u m b e r . . . " . Such a deduction is simple for a human student, and may well be made without conscious effort. We conjecture, however, that the initial difficulties that people have in writing their first c o m p u t e r programs arise because information filling in gaps of this sort is not explicitly supplied in their instruction. Related to this point is the fact that textbooks are sometimes imprecise or even incorrect on some matters. The combined operations of manual editing and a u t o m a t e d text processing as proposed in this p a p e r will reveal such flaws, which presumably should have been discovered and remedied during the editing process preceding publication. Two examples may be cited from the present text. An imprecision or ambiguity is present regarding the use of the unary plus and minus operators. The text (in its original form, not just in the edited version) defines these operators for numbers, but it is silent as to whether they may be used with variables or (parenthesized) expressions. In the context of expressions, ' + ' and ' - ' are mentioned only as binary operators. Indeed, the synthesized rules will not evaluate an expression such as ' - X ' or ' + ( A - B ) ' , and in failing to do so, accurately represent the knowledge conveyed by the text and no more. An outright error was edited out of the text or the text analysis process would not have synthesized a working set of rules. Sentence (A.9) originally read "we often refer tO the n u m b e r in a [mailbox] as a variable". This blunder, confusing a variable with its value, is an example of misinformation which human readers (and, someday, computer programs) must cope with while learning from a text. Illustrative examples often provide the information necessary to resolve problems of this nature successfully. A n o t h e r example of information the reader must synthesize is the analog of the d o c o m m a n d rule we supplied to act as a 'driver' for the rules synthesized by the text analysis process. In the text we adapted, this idea of executing or performing a c o m m a n d after it is parsed is conveyed largely by the presentation of examples. Clearly a more complete and powerful text analysis program will have to incorporate techniques of induction. A n o t h e r issue is that while the rules synthesized by the text analysis process can parse and evaluate aASlC expressions and L E T commands, they can not compose them. Actually, because of the way PROLOG statements are executed (there is no presumption as to which of a goal's arguments are inputs and which are outputs), one can almost execute the rules to compose a legal L E T command. (This fails in practice because the rules are not ordered correctly for 'writing', and because of side-effects which are irreversible, but in theory these shortcomings could be remedied if desired.) However, such a c o m m a n d would be composed in a most unintelligent manner. Absent would be any and all semantic and pragmatic guidance, such as knowledge that variable names must be kept distinct unless it is desired to reference the same 'container' (or a variable is
AUTOMATED ANALYSIS OF INSTRUCTIONAL TEXT
335
somehow free for reuse). The point is that knowledge of how to parse and evaluate elements of a computer language does not imply knowledge of how to write code using those elements intelligently. Textbooks on programming are not always conscientious in making this point, which has definite implications for the creation of a successful program for automated text analysis. There are many relevant areas of computational linguistics which we have ignored almost entirely. We mentioned the problem of anaphora resolution in the discussion of the analysis of sentence (A.8). Another is the problem of dealing with the topic or focus of discourse (see, e.g., [12]). It is unproductive to draw every presupposition and contextual implication from every sentence, and those which are explicitly asserted must be managed so that the context remains current and properly limited as the focus of discussion shifts. We have seen examples of this as various concepts were active during the text analysis process. Note though that other aspects of context were ignored completely (because we knew they would not be needed in this case, surely not an adequate justification for a general methodology), e.g., the implicit reference in (A.4) to the fact that there are twenty-six letters. We have not as yet made any attempts to deal with synonymy. If a 'container' had been called a 'variable-holder' (or anything else) in one of the sentences that were analyzed, our program could not have made the required association. Since textbooks use synonymy heavily to avoid a repetitious style, this clearly is an issue which should be addressed. Even though we feel that we have made some significant progress and obtained some useful insights, we believe that the most promising next step for our project would be to continue extending our simulation to the rest of the chapter of text, i.e., progress to a program which could process text covering the subject of BASIC up to the program level. We do not yet have enough examples of the uses of various text analysis methods to make generalizations with confidence or otherwise stand back and synthesize our insights into the text analysis process. Even if we can complete analysis of the entire chapter, it will be necessary to adapt the methodologies used, in order to process automatically textbooks whose subject matter has a different character. We are definitely making use of the fact that our sample text conveys knowledge that is essentially procedural. Surely many of the difficulties we will have to overcome, however, will also apply to automated analysis of other kinds of texts, and any solutions that we find for these difficulties will provide insights which will be useful in general for automated text processing. If our program can be extended successfully, it then would be desirable to attempt to augment it by the incorporation of techniques for induction, thus allowing the analysis of textual material which includes illustrative examples. Such an extension would be decidely non-trivial.
336
L.M. NORTON
Appendix I The PROLOG rules synthesized by the text analysis process are reproduced below, with the exception of asserted concepts and updates of the lexicon. The syntax used is that of the Edinburgh implementation of PROLOG [3]. In particular, variables begin with a capital letter. Therefore we have used atoms such as "bASIC', to avoid having to quote atoms to make them constants. We assume familiarity on the reader's part with this version of PROLOG. All synthesized rules, unless noted otherwise, are asserted using asserta, insuring that the most recently synthesized rule will be tried first at execution time. ( A . I ) - ( A . 3 ) No rules other than asserted concepts were synthesized as a result of analyzing these three sentences. (A.4) computer_container_name(bASIC,[computer_container_name,V]) [C1],{letter(C1),characterl(C1),name(V,[C1])}.
This definite clause g r a m m a r rule parses the one-letter form of a 'computer_ container_name'. Note how it 'labels' its output. The unit clause below is the companion assertion which is needed by our labelling convention, as discussed in the body of the paper. computer_container_name(bASIC,[computer_container_name,V]).
(A.5) character1 (C). computer_container_name(bASIC,[computer_container name,V]) [C1,C2],{character2(C1),character3(C2),name(V,[C1,C2])}.
The companion assertion to this parsing rule has already been synthesized. (A.6) character2(C) : - letter(C).
(A.7) character3(C) : - digit(C).
(A.~) denotation (X,Y) : - computer_container_name(bASIC,X), number_in_computer_container(bASIC,X,Y).
This rule says that if X is the name of a 'computer_container', then X denotes the number in that 'computer_container'. The intended interpretation of the term 'number_in_computer_container(C,X,Y)' is "in the context of C, Y is the number in the 'computer_container' named by X". This rule is not invoked when the synthesized program is executed; it figures in the analysis of (A.9). (A.9) variable(V) : - computer_container_name(V).
This rule is asserted "just in case', and will never be invoked.
A U T O M A T E D ANALYSIS OF INSTRUCTIONAL TEXT
337
variable(bASIC,[variable,V])-* computer_container_narne(bASIC, [computer_cont ainar_name,V] ). variable(bASIC,[variable,V]).
The preceding two clauses are the parsing rule and its companion assertion. denotation(X,Y) :-variable(bASIC,X), number_in_computer_container(bASIC,X,Y).
This is the denotation 'propagated from' (A,8). This rule will not be invoked either; it figures in the analysis of (D.2). (A.IO) value(X,Y) :- variable(bASIC,X), number_in_computer_container(bASIC,X,Y). variable_value(bASIC,[variable_value,VV])-* number_in_computer_containar(bASIC,[number_in_computer_container,VV]). variable_value(bASIC,[vadable_value,VV]). The above parsing rule and its companion assertion are synthesized 'just in case', and will never be invoked. (B.1) constant(C) :- number(C).
This rule is unnecessarily asserted 'just in case'. constant(bASIC,[constant,C]) --, numbar(bASIC,[number,Ci). constant(bASIC,[constant,C]).
Compare with the rules asserted as a result of analyzing sentence (A.9). (B.2) legalchar(number,bASlC,C,CL,NCL) : - decimal_poinl(C),character4(C), legalchara(decimaLpoint,CL, NCL). initparse(nu mber, bASIC,[[decimal_poin l,[0,1,0,]],[digit,J0,131071,1 ]]]).
These two rules require some explanation. They interface with rules expressing subject-related background knowledge about numbers. The top level rule for parsing numbers involves the subgoals initparse to initialize the process, legalinitchars, legalchars, and legalfinalchars to accept legal initial, intermediate, and final characters respectively, and further subgoals to finish the parsing process. (Legalchars, in turn, has the subgoal legalchar.) The process uses an 'array' associating with each legal intermediate character a triple indicating (i) the number of occurrences of that character (thus far) in the number being parsed; (ii) the maximum number of occurrences allowed; and (iii) the minimum number of occurrences allowed. The largest integer representable in PROLOG is used to represent an 'unbounded' maximum. Legalchara counts occurrences and checks that the maximum is not exceeded (CL and NCL are the old and new 'arrays', respectively).
338
L.M. NORTON
(B.3) character4(C). legalchar(number,bASIC,C,CL,NCL) :- comma(C),character5(C), legalchara(comma,CL,NCL). initparse(nurnber,bASIC,[[comma,[0,0,0]],[decirnal_point,[0,1,0]], [digit,J0,131071,1 ]]]). N o t e the m a x i m u m of zero o c c u r r e n c e s of c o m m a s a l l o w e d in (BASIC)n u m b e r s . E a c h new i n i t p a r s e assertion replaces t h e p r e c e d i n g one.
(B.4) character5(C). legalinitchars(number,bASlC,[C]) ~ [C],{minus..sign(C),character6(C)},!.
As noted in the text, there is no need for array elements expressing c o n s t r a i n t s for initial c h a r a c t e r s .
(]3.5) character6(C). legalinitchars(nurnber,bASIC,[C]) --, [C],{l~US_sign(C),characterT(C)}, L
(End of Section B). characterT(C).
T h e p r o c e s s i n g which occurs at t h e e n d of a section of text causes t h e assertion of rules such as this o n e n e e d e d to c o m p l e t e ' u n f i n i s h e d business'. (C.l) expression(bASlC,[expression,.. E])--* operand(bASIC, O), (blanks,operator(bASIC,OP),blanks, expression(bASIC,[expression,.. E1 ]), (E = [O,OP,.. E1]};{E = [O]}). operand(bASlC,O)--,"C,blanks,expression(bASIC,O),blanks,")'.
(c.2) operand(bASIC,O) --, variable(bASIC,O).
(c.3) operand(bASIC,O) --, constant(bASIC,O).
(c.4) operator(bASlC,[operator,OP]) -* operator8(bASIC, [operator8,OP]); operatorg(bASIC,[operator9,OP]) ;operator 10(bASlC,[operator 10,OP]); operator 11 (bASIC,[operatorl 1,0P]) ;operator 12(bASIC,[operator 12,OP]). operator(bASIC,[operator,OP]). operatorS(bASIC,[operator8,OP])---," + ",{name(OP,"+ ")}. operatorS(bASlC,[operatorS,OP]). operatorg(bASlC,[operatorg,OPl)--*"-",{name(OP,"-')}. operatorg(bASIC,[operatod3,OP]). operator 10(bASlC,[operator 10,OP])--*"*",{name(OP,"*") }. operator10(bASlC,[operator 10,OP]). operator 11 (bASlC,[operator 11,0P])--*"/",{name(OP,"/")}. operator 11 (bASIC,[operator 11,OP]). operator12(bASlC,[operator 12,OP}) --,"**" ,{name(OP. "** ")}. operator12(bASIC,[operator 12,OP]).
AUTOMATED ANALYSIS OF INSTRUCTIONAL TEXT
339
(c.5) operator12(bASlC,[operatorl 2,OP])--, [C],{circumflex(C),name(OP,[C])}.
The companion assertion to this parsing rule has already been synthesized.
(D.1) command(bASIC,[IET,V,E]) -*"LET",blanks,veriable(bASIC,V),blanks,"= ", blanks,expression(bASIC,E).
(D.2) execute(bASIC,[IET,V, E]):-value(E,EV), storetract (number_in_computer_container,bASIC ,V,_), storeassert(number in_computer_container,bASIC,V, EV,a).
Recall that storetract and storeassert are locally defined extensions to the built-in procedures of the PROLOG language. The statements which define them are not considered to be rules embodying prerequisite knowledge.
Appendix II The text analysis program is too large to reproduce in its entirety here; we present in this appendix only those (PROLOG) procedures which were executed in the course of processing sentence (A.8). These procedures constitute about 20-25 per cent of the text analysis program. We hope that they are sufficient to illustrate for the interested reader how our techniques are implemented using PROLOG.
The goal corresponding to (A.8) is: ?- sentence([w•••ft•n•use•th••nam•••••a•••ntainer•t••indi•ate•th••number•in•it]•P)• PROLOG'S (sole) solution of the goal is:
P = denotes([computer_container_name,[class,name],[symbrep,yesl,[num, l], [art,clef],[case,obj],[ppmod,[of,[container,[[card,ll,[num, 1], [art,indef],[case,obj]]]]]],[number_in_computer_container, [class,number],lsymbrep,yesl,[num,1],[art,defJ,[case,obj], [ppmod,[in,[container,[[per,3l,[num,1],[case,obj]]]]]]).
This is the 'summary' of the sentence in the form of a PROt.OGterm; it has the structure denotes([computer_container_.name,.. FL1],lnumber_in._computer_ctmtainer,.. FL2]).
A side effect of the analysis of this sentence is the assertion of the rule (explained in Appendix I): denotation(X,Y) :- computer_container_name(bASIC,X), number_in_container(bASIC,X,Y).
The lexical entries relevant to (A.8) are: I1we([[pro,[num,2l,[per, 1],[case,subl]]]). I1often([[adv,[verbrnod,yesi]]). I1 use([[verb,[proc,verbusel,[num,2]]]).
340
L.M. NORTON
1lthe([[art,[art,def],[num,_]]]). 11name([[noun,[num,1],[syrnbrep,yas]],[verb,[proc,verbnull],[num,2]]]). I1of([[prep)]). 11a([[art,[art,indef],[num,1],[card,I]]]). 11container([[noun,[num,1]]]). Ilto([[prep]]). I1indicate([[verb,[proc,verbindicate],[num,2]]]). Ilnumber([[noun,[proc,quantnoun],[num,1],[symbrep,yes]]]).
Ilin([[prep]]). I1it([[pro,[num, 1l,[per,3],[case,_]]]).
The texically indicated procedures (and some of their specific sub-procedures) are: verbuse(VERB,[NP,Fk, FLV,P],VFL)--~{mergefeatl(VFk, FkV,_),genoun(NP,Fk)}, np(OBJ,[[case,obj]], NFL),[[tol_]],lookup(verb, [NP,FL,[[instr,[OBJ, NFL]]],P],[[not,[tns,_]],[not,[part,_J]],fail). *verbuse(VERB,[NP,FL, FLV,is_a([NPICINP1 FLC],[NP2CJNP2FLC])],VFL) -, {mergefeatl(VFL, FLV, NFLV)},({rnember([voice,pass],NFLV),NP1 = NP, NP1FL = FL};np(NP1 ,[[case,obj]],NP1FL)),[[asl_]], rip1 (NP2,[[case,obj]],NP2FL),{getconcept(NP1 ,NP1 FL,NP1 C, NP1 FLC), verbusea(NP1C,NP1 FLC,NP2,NP2FL,NP2C,NP2FLC),mkcurconcept(NP2C,NP2FLC)}.
(The second rule of the above procedure was not executed successfully during the analysis of (A.8). All such 'unnecessary' rules of the procedures presented in this appendix are marked with an initial "*'.) verbindicate(VERB,[NP,FL,FLV,denoIes([IOBJCIIOBJCFL],[OBJCIOBJCFL])],VFL)--, {mergefeatl(VFL, FLV,NFLV),genoun(NP, FL)},np(OBJ,[[case,obj]],OBJFL), ({rnember([clausemod,uses([COBJICOBJFL],[IOBJIIOBJFL])],OBJFL),FLAG = a}, !; {member([ppmod,[by,[IOBJ,IOBJFL]]],OBJFL),FLAG = a},!; {mernber([instr,[IOBJ,IOBJFL]],NFLV),FLAG = b}), {member([symbrep,yas],lOBJFL), verbindicatea(FLAG,OBJ,OBJFL,IOBJ, IOBJFL,OBJC,OBJCFL, IOBJC,IOBJCFL)}. *verbindicatea(a,OBJ,OBJFL,N P,FL,OBJC,OBJCFL, NPC, FLC):getconcept(OBJ,OBJFL,OBJC,OBJCFL),getconcept(NP,FL,NPC, FLC), verbdenotea(NPC, FLC,OBJC,OBJCFL),mkcurconcept (NPC, FLC). verbindicatea(b,OBJ,OBJFL,NP,FL,OBJC,OBJCFL,NPC,FLC) :getconcept (NP, FL, NPC,FLC),getconcept (OBJ,OBJFL,OBJC,OBJCFL), verbdenotea(NPC, FLC,OBJC,OBJCFL),mkcurconcept(OBJC,OBJCFL). verbdenotea(N1,FL1 ,N2,FL2) :- bigcontext(C),T1 = .. [N1 ,C,V1], T2 = . . [N2,C,Vl ,V2],loudassert((denotation(Vl ,V2) :-','(T1 ,T2)),a). *quantnoun(QNOUN,[NP, IFL,OFL],NFL) -* [[ofL]], {delfeats([[num,N],[art,A],[rnod,_],[card,_],[case,CASE]],lFL,FL1 ), qnouna(A,NFL)},npl (NP,[[nu m,2],[case,obj] IFL1 ],FL2), {delfeat ([per,_],FL2,FL3),overridefeat([case,CASE], FL3,OFL)}. (The above rule is the one discussed in the last paragraph of Section 3 of this
paper.) The analysis of (A.8) is dependent upon the presence of three concepts asserted when analyzing sentences (A.I)-(A.3).
AUTOMATED ANALYSISOF INSTRUCTIONALTEXT concept(container,computer_container,[[use,atom], [desc,[possesses,[computer,container]]]]). concept(number,number_in_computer_container,[[use,atom], [desc,[in,[computer_container,number]]]]). concept(name,computer_container_name,[[use,atom], [desc,[possesses,[computer_container,name]]]]). The
top-level analysis procedure is
the following.
sentence(L, P) :-lexlook(L, DL),sent 1(P,DL,[ ]). The rest of the procedures used are presented below.
Lexical lookup procedures: lextook([ 1,[ 1). lexlook([WlL],[[WID]lM]) :- atomic(W),!,name(W,L1 ),lex(L1 ,D),lexlook(L,M). *lexlook([WiL],[[W]lM]) :-lexlook(L,M). lex(L,D) :-Iexl(L,D),!;reverse(L,L1),lexa(L1,D),!;D = [ ]. lexl (L,D) : - name(W,[108,491L]),Y = .. [W,D],Y.
Parsing procedures: sent1 (P) --~np(N P,[[case,subj]],FL),vp(NP,FL,P). *sent I(P)--~ [[therel_]],lookup(cop,[P],[ ],therecop). np(NP, IFL,OFL)--~lookup(quant,[NP, IFL,OFL],[ ],npq);npl(NP,IFL,OFL). npl (NP,IFL,OFL)--*preadj(NP,IFL,OFL);np2(NP,IFL,OFL); Iookup(pro,[NP, IFL,OFL],[ ],npro). preadj(NP,IFL,OFL) -* Iookup(art,[NP;IFL,OFL],[ ],npart),!; Iookup(dem,[NP, IFL,OFL],[ ],npart),!;Iookup(poss,[NP, IFL,OFL],[ ],npart). npart(PREART,[NP,IFL,OFL],NFL) -* {mergefeatl(NFL, IFL,TFL)}, np2(NP,TFL,OFL). np2(NP, IFL,OFL)--~gadj(NP,IFL,OFL);np3(NP,IFL,OFL). np3(NP,IFL,OFL)---~Iookup(noun,[NP,IFL,OFL],[ |,np3a). *np3a(NOUN,[NN,IFL,OFL],NFL)-*{member([num,1],NFL)},npG(N,IFL,FL1), {mkcompoundnoun(NOUN,N,NN,FL1,0FL)}. np3a(NOUN,[NOUN,IFL, OFL], NFL)--~ {mergefeatl(NFL, IFL, FL1)}, npostmod(NOUN,FL1,0FL). npostmod(N, FL,OFL)--*lookup(prep,[FL,OFL],[ ],pp),!; Iookup(adv,[FL,OFL],[ ],npostadvpp),!. *npostmod(N, FL,[[clausemod,S]lFL])-*{asserta(mkcurconcept (_,_))}, (Iookup(verb,[X,[ ],[ ],S],[[part,past|],fail); Iookup(verb,[N,FL,[ ],S],[[part,pres]],fail); ret ractfail(mkcurconcept (_,_))), !,{ret ract(mkcurconcept(_,_))}. npostmod(N,FL, FL,L,L). *retractlail(C .... ) : - retract(C),!,fail. pp( PREP,[IFL,OFL],PFL)--, np(NP,[[case,obj]],FL), {append(IFL,[[ppmod,[PREP,[NP,FLm],OFL)}. npro(PRO,[_.,IFL,OFL], FL)--* {mergefeatl(FL, IFL,OFL)}. vp(NP, FL, P) --~{verbfeats(FL, FLV)},vpa(NP,FL,FLV, P). vpa(NP,FL,FLV, P)-*Iookup(adv,[NP,FL,FLV,P],[ ],vpadv); Iookup(verb,[NP,FL,FLV, P],[ ],fail);Iookup(cop,[NP, FL,FLV,P],[ ],coproc); Iookup(aux,[NP,FL,FLV,P|,[ ],auxproc). vpadv(ADV,[NP,FL, FLV,P],AFL) ~ {member([verbmod,yes],AFL)}. vpa(NP, FL,[[adv,ADV]IFLV],P).
341
342
L.M. NORTON
Utility procedures: Iookup(CAT, L, REQFL, DEFAULTPROC)-->[[WIDL]], {df(CAT,DL,FL),verifyfeatl(REQFL, FL)},Iookupa(CAT,L,W, FL, DEFAULTPROC). df(CAT,[{CATIFL]L],FL). *df(CAT,L,LIL1],FL) :- df(CAT,[LIL1],FL). verifyfeatl([ ],_). verifyfeatl([[not,P]JL],FL) : - member(P, FL), !,fail;!,verifyfeatl(L,FL). .verifyfeatl([PIL],FL ) :- member(P,FL),verifyfeatl(L,FL). Iookupa(CAT,L,W, IFL, DEFPROC)--* {member([proc, PROC], IFL)}, !, {delfeat([proc, PROC],IFL, FL)},(useproc(PROC,W,L, FL), t; {member(CAT,[verb,cop,aux])},advpostmod(L,W, FL,FL1 ), (useproc(PROC,W,L,FL 1), !;useproc(DEFPROC,W,L, FL 1), W); useproc(DEFPROC,W,L,FL)). tookupa(CAT, L,W, FL,DEFPROC)--* {rnember(CAT,[verb,cop])}, advpostmod(L,W, FL,FL1),useproc(DEFPROC,W,L, FL 1), {; useproc(DEFPROC,W,L, FL). useproc(PROC,W,L, FL,R,RR) : - Y = .. [PROC,W,L, FL, R,RR],Y. *mergefeatl(L,[ ],L) :- !. mergefeatl([PIL],lFL,OFL) :- addfeat(P, IFL,TFL),mergefeatI(L,TFL,OFL). mergefeatl([ ],FL, FL). addfeat([N,V],FL,FL) : - member([N,X],FL),!,V = X. addfeat(PAIR, FL,[PAIRIFL]). delfeats(L,[ ],[ ]) :- !. *delfeats([ ],FL,FL). delfeats([PIL],lFL,OFL) : - delfeat(P, IFL,TFL),delfeats(L,TFL,OFL). delfeat([N,V],[[N,V]lL],L ) :-!. delfeat(PAIR,[PIL],[PINL]) :- delfeat(PAIR,L,NL). .delfeat(_,[ ],[ ]). verbfeats([ ],[ ]). verbfeats([[A,V]lL],[[A,V]lNL]) : - member(A,[nu m,per,tns,voice,part]),!, verbfeats(L, NL). verbfeats([PIL],NL) :-verbfeats(L,NL). genoun(V,FL) : - var(V) ;member([specif,no], FL). .getconcept(V,FL, V,FL1) :-var(V),!,write('Unresolved antecedent.'),nl,fail; member([class,_],FL),FL1 = FL. getconcept(NP,FL,NPC,CFL) : - getroot(NP,FL, R),getconcepta(NP,FL, NPC,CFL,R). getconcepta(N,FL,CN,[[class,R]lNFL],R) :- concept(R,CN,CFL), cmergefeatl(CFL, FL,NFL),!. .getconcepta(N,FL,CN,[[class,R]lFL],R) : - member([ppmod,[P,[OBJ,OFL]]], FL), getroot(OBJ,OFL,OBJR),mkppconcept(P,N,R,OBJR,CN), !. ,getconcepta(N, FL, R,FL, R). .getroot(NP, FL, R) :- member([root,R],FL), !. getroot(N, FL, N). crnergefeatl(CFL, FL, NFL) : - descheck(CFL,FL, NCFL),mergefeatl(NCFL, FL,NFL). descheck(CFL,FL,NCFL) :-delfeats([[desc,P],[use,_]],CFL, FL1), deschecka(P, FL,FL1,NCFL). .deschecka(P, FL, FL1 ,FL1):- var(P),!. deschecka([possesses,[NPC,Y]],FL,FL1 ,NCFL) :membef(~,[PREP,[NP, NFL]]],FL),I,NCFL = FLI,(PREP = of,!, getconcept(NP,NFL,NPC,CFL);!);NCFL = [lppmod,[of,[NPC,[ ]]]]IFL1]. deschecka([in,[NPC,Y]],FL, FL1,NCFL) :- member([ppmod,[PREP,[NP, NFL]]],FL), !,NCFL = FLI,(PREP = in,!,getconcept(NP, NFL,NPC,CFL);!);
A U T O M A T E D ANALYSIS OF INSTRUCTIONAL TEXT
343
NCFL = [[ppmod,[in,[NPC,[ ]ll]IFLI]. Ioudassert(TERM,F) :- (alreadyexists(TERM),!,
write('Pre-existing
as,..,~e~lion: ');
lassert(TERM,F),write('Asserting: ')),display(TERM),nl. lassert(TERM,a) :- asserta(TERM). • lassert(TERM,z) :- assertz(TERM). mkcu rconcept(NP, FL) :- member([class,_], FL), !; sentno(N),loudassert (concept (NP,NP,[[use,atom],[sentno, N][FL]),a). bigcontext('bASIC'). member(X,[Xl_]) : - !. member(X,[_lZ]) : - member(X,Z). append([ ],X,X). append([HIX],Y,[HIZ]) : - append(X,Y,Z). reverse(X,Y) :- revl(X,[ ],Y). revl([ ],X,X). revl ([XIY],Z,W) :- revl(Y,[XlZ],W). Appendix III To indicate the nature and extent of the editing that was performed on the o r i g i n a l t e x t , w e r e p r o d u c e b e l o w t h e u n e d i t e d p a r a g r a p h s f r o m [6] w h i c h , a f t e r e d i t i n g , b e c a m e t e x t S e c t i o n s A a n d B as g i v e n in t h e b o d y o f t h e p a p e r . Note that the original text given below contains the erroneous information w h i c h w e h a d t o c o r r e c t w h e n f o r m u l a t i n g o u r s e n t e n c e ( A . 9 ) , as m e n t i o n e d in Section 6 of the paper. A. The computer is provided with a number of 'mailboxes', each of which hold a number. Each mailbox has a name. There are twenty-six mailboxes with simple one-letter names: A, B, C . . . . Z. The remaining mailboxes have two-character names--a letter followed by a digit: A0, AI . . . . Ag, 130. . . . 139, CO. . . . Z9. For convenience, we often use the name of a mailbox to indicate the number in it. And since the number in a mailbox may be taken out and a new one put in its place we often refer to the number in a mailbox as a variable (since it may vary as the program is executed). Thus variable A means the number currently in mailbox A, variable B3 means the number currently in mailbox B3. can
B. A constant is simply a number written in the program. The rules for writing numbers apply to both constants and numbers included as data: t. A decimal point may or may not be included. 2. A minus number is indicated by preceding the number with a minus sign. 3. A positive number need not be preceded by a plus sign. 4. Commas may not be included. REFERENCES 1. Lederberg, J. (Quoted in Freiherr, G., The Seeds of Artificial Intelligence (Division of Research Resources, National Institutes of Health, Bethesda, MD, 1980) 67). 2. Larkin, J., McDermott, J., Simon, D.P. and Simon, H.A., Expert and novice performance in solving physics problems, Science 2~8 (1980) 1335-1342. 3. Pereira, L.M., Pereira, F.C.N. and Warren, D.H.D., User's Guide to DECsystem-lO PROLOG (University of Edinburgh, Edinburgh, 1978).
344
L.M. NORTON
4. Kowalski, R.A., Logic [or Problem Solving (North-Holland, Amsterdam, 1979). 5. Silva, G. and Dwiggins, D., Toward a PROLOG text grammar, SIGART Newsletter 73 (1980) 20-25. 6. Sharpe, W.F., BASIC--An Introduction to Computer Programming Using the BASIC Language (Free Press, New York, 1967). 7. Marcus, M.P., A Theory of Syntactic Recognition for Natural Language (MIT Press, Cambridge, MA, 1980). 8. Webber, B.L., A Formal Approach to Discourse Anaphora (Garland, New York, 1979). 9. Pereira, F.C.N, and Warren, D,H.D., Definite clause grammars for language analysis---a survey of the formalism and a comparison with augmented transition networks, Artificial Intelligence 13 (1980) 231-278. 10. Rieger, C. and Small, S., Toward a theory of distributed word expert natural language parsing, IEEE Trans Systems Man Cybernet. 11 (1981) 43-51. 11. McCord, M.C., Using slots and modifiers in logic grammars for natural language, Tech. Rept. No. 69A-80, University of Kentucky, 1980. 12. Grosz, B.J., The representation and use of focus in a system for understanding dialogs, Proceedings of the 5th IJCAI, Cambridge, MA, 1977.
R e c e i v e d J a n u a r y 1982; revised version received September 1982