Comput. Lang. Vol. 19, No. 4, pp. 247-259, 1993 Printed in Great Britain. All rights reserved
ERROR
0096-0551/93 $6.00 + 0.00 Copyright © 1993 Pergamon Press Ltd
HANDLING IN A PARALLEL SUBSTRING PARSER
LR
GWEN CLARKE and DAVID T. BARNARD~" Computing and Information Science Department, Queen's University, Kingston, Ontario, Canada K7L 3N6
(Received 18 May 1992; revision received 22 July 1993) Abstract--Parallel parsing is currently receiving attention but there is little discussion about the adaptation of sequential error handling techniques to these parallel algorithms. We describe a noncorrecting error handler implemented with a parallel LR substring parser. The parser used is a parallel version of Cormack's LR substring parser. The applicability of noncorrecting error handling for parallel parsing is discussed. The error information provided for a standard set of 118 erroneous Pascal programs is analysed. The programs are run on the sequential LR substring parser. LR parsing bounded context noncorrecting error handler
grammars
error handling
parallel LR substring parsing
1. I N T R O D U C T I O N
With the recent development of parallel parsing algorithms (for example, see the thesis by Grafter for some particular examples and a review of related work [1], and the broader survey by Skillicorn and Barnard [2]) there is a need to incorporate error handling techniques. This paper examines the use of a noncorrecting error handler within parallel parsers. This type of error handler is applied in a parallel LR substring parser developed by Clarke and Barnard [3, 4]. The algorithm is a parallel version of Cormack's LR substring parser [5]. Cormack implements the ideas of suffix analysis and substring parsing developed by Richter [6]. Since LR grammars have the correct prefix property, an error is detected where it first occurs. Although the prefix is syntactically correct, it is not necessarily what was intended. For example, in the following Pascal program segment: "Procedure x x : integer;" an error is found when the colon is processed. However, a function rather than a procedure was probably intended. Richter suggests that upon discovering an error, interval analysis should be done; this means that the program is parsed in reverse from the error token until an error is again encountered. Cormack implements interval analysis in his substring parser. The assumption of this paper is that the user wants multiple messages. Then error recovery is necessary to discover multiple syntax errors during a single compilation. Richter's method finds the smallest possible region in which an error is located and it resumes parsing with the token immediately following the error token. The programmer is further assisted by the presentation of the context surrounding the erroneous region and acceptable input at the margins of this region. The method suggested by Richter, and implemented by Cormack, does not attempt to correct or repair errors. Error correction must make assumptions about the programmer's intentions; such a task is impossible to perfect and incorrect assumptions may result in spurious error messages. With sequential parsing cascading errors may result to the right of a repair; in parallel parsing errors may cascade both to the left and to the right. In a parallel application there is the danger of having to reparse to the left if the repair is done to the left of the detection point. With multiple errors reparsing may have to occur several times and the advantage of parsing to the right in parallel is diminished. The successful inclusion of a noncorrecting error handling into the parallel LR substring parser is demonstrated. Also, results are presented from trials run on the sequential LR substring t T o whom all correspondence should be addressed. 247
248
GWENCLAXKEand DAVIDT. BARNARD
parser with a standard set of erroneous Pascal programs developed by Ripley and Druseikis [7]. This demonstrates the clarity of messages produced by the substring parsing algorithm. Overview
Section 2 discusses the background material in the areas of LR substring parsing and error handling• Section 3 discusses the incorporation of a noncorrecting error handler into the parallel LR substring parser. Section 4 discusses the erroneous trials run on the parallel version for an expression grammar. Section 5 discusses the second set of erroneous trials that demonstrate the information produced by the sequential parsing algorithm on a standard set of erroneous Pascal programs. Finally, Section 6 contains the conclusions.
2. B A C K G R O U N D This paper assumes that the reader is familiar with compilation and parsing. For more information the reader is directed to Ref. [8]. Some knowledge of LR parsing is also assumed. 2.1. Terminology and conventions
Any terminology and conventions not defined here can be found in Ref. [8]. The following conventions are followed: • T is the finite set of terminal symbols; • N is the finite set of nonterminal symbols; • capital letters at the beginning of the alphabet (A, B, C,...) are nonterminal characters, e.g. A E N; • capital letters at the end of the alphabet are strings of terminals and nonterminals, e.g. X e (NUT)*; and • small letters at the end of the alphabet are strings of terminals, e.g. x • T*. A context-free grammar (CFG) is defined as a 4-tuple, G = (T, N, S, P ) where S is the start symbol and P is the set of rules, also referred to as productions. Productions are of the form A ~ X . L H S refers to the left-hand-side of a production; R H S refers to the right-hand-side. A =~ X means that A derives X in one step. A =:- X means that A derives X in one or more steps, •
e.g. A =*, XI =~ X2=~ " " =*. X. L ( G ) is the language generated by the grammar G. L ( G ) = { x l S * x } . x is said to be a sentence of L~G). A sentential form is derivable from S according to the grammar G, SF(G) = {XlS ~ X}. The rightmost derivation for X =:- ,~x is the sequence X =~ X~ =~ )(2 :~ • ' • =~ x where in each step the rightmost nonterminal is replaced using a production, p • P, of the grammar G. If X = W A z and A =:,Z where W and z may be the empty set when X=~ W Z z . A C F G is said to be unambiguous if x • L ( G ) implies x has a unique rightmost derivation. For any language L ( G ) if A ~ X Y • P, then A ~ Y X • P"~ for the language L'eO(G'e~'). When a grammar class is symmetric, if G • Gc~a~,then G'eZ• Gcta,s,
LR parsing is a left to right, bottom-up strategy that successively reduces input tokens and previously recognized nonterminals until the start symbol is recognized. LR parsing produces the rightmost derivation of a sentence. Grammars described as being LR(k), where the k refers to the number of lookahead symbols. When k = 0 the current input symbol only is needed to choose a parsing action. When k = 1 the current input symbol and the next input symbol must both be known. BC(1,1) refers to 1-1 bounded context grammars [9, 10]. A 1-1 bounded context grammar has two characteristics: (1) for every rule A-~X, if for some sentential form containing aXb, A derives X then, A derives X for all sentential forms containing aXb; and (2) the grammar is LR(1).
Error handling in a parallel L R substring parser
249
SBC(1,1) refers to 1-1 simple bounded context grammars. A 1-1 simple bounded context grammar has two characteristics: (1) for every rule A ~ X , if for some sentential form containing aX, and for another sentential form containing Xb, A derives X, then, A derives X for all sentential forms containing aXb; and (2) the grammar is LR(0). Both BC(1,1) and SBC(1,1) grammars are symmetric. Sequential compilers assume that when a symbol table lookup is done the compiler's internal table is complete for the current scope. This cannot be assumed during parallel compilation. The Doesn't Know Yet (DKY) problem exists when a table being searched is not yet complete [11]. When referring to input, left is towards the beginning and right towards the end. A processor, x, is said to be left of a processor, y, if it parses tokens corresponding to input nearer the beginning than those parsed by y and the sections parsed by x and y are adjacent. The leftmost processor at any given level in the tree is the processor parsing tokens that correspond to the leftmost section of input. Similarly, the rightmost parses the rightmost section. A distinction is made between the error token, error detection point, and the error position. The error token is the token at which the error is detected. The error detection point is the position of the error token. The error position is the location where changes are required to correct the error, it may not be determinable. A token is a nonterminal or a terminal symbol.
2.2. Substring LR parsing Richter [6] introduces the notion of noncorrecting error handling and develops the theory necessary for suffix analysis, substring analysis and interval analysis. Three limitations of these methods are noted by Richter. Substring parsers are not necessarily deterministic unless the grammar is a bounded context grammar [9, 10]. The second limitation is that a suffix parser does not produce a parse tree that can be used for semantic analysis. Richter suggests that a normal LR parse be done and suffix analysis only initiated upon finding the first error. Hence, if no errors are found an LR parse is completed. A final limitation is that it is possible to miss errors with these strategies; the errors missed are mismatched parentheses. A missing parenthesis is not detected when another error occurs within the scope. It is argued that such situations are not common and worth overlooking because at least all the errors that are detected are actual errors. Such errors would be found on a subsequent parse after the correction of existing errors. Cormack [5] implements a substring parser; his implementation is motivated by Richter's paper. The substring parser is constructed in a similar manner to the construction of an LR parser. Adjustments are made to the LR parser in order to accommodate the recognition of substrings. The usual items A ~ X " Y, where A ~ X Y • P, are included and also new suffix items of the form A ~ . . . . Y. It is these suffix items that give the parser the ability to recognize only a suffix of the grammar when the entire sentence is not available. The parser uses the LR parsing automaton. The parser recognizes a suffix language and, hence, it also recognizes a substring language because of the correct prefix property for all LR parsers. There are two constructions developed, one for BC(1,1) grammars, which is referred to as BC-LR(1,1), and the other for SBC(1,1) grammars, which is referred to as SBC-LR(1,1). Sentences of BC(I,I) grammars can be recognized by an LR(1) automaton while those of SBC(1,1) can be recognized by an LR(0) automaton. Cormack enhances the SBC-LR(1,1) by eliminating unit rules and merging structurally equivalent subparses during construction; a grammar for Pascal is given by Cormack that runs on the enhanced SBC-LR(1,1) parser. The grammars are symmetric and, therefore, the substring parser can produce a right-to-left parse. Interval analysis is implemented by parsing left-to-right until an error is found and then parsing right-to-left starting with the error token until another error is found. Parsing then proceeds from the token immediately to the right of the right margin of the last interval processed. The substring parser is deterministic and, if there are no syntax errors, it produces a parse tree that
250
GWENCLARKEand DAVIDT. BARNARD
can be used as input for semantic analysis. If there are syntax errors the algorithm produces a forest of partial parse trees. There are many models of parallel machines. We implemented the parallel LR substring passing algorithm on 7, 15, and 31 node balanced binary tree architectures [3, 4]. The architecture is loosely coupled, multiple instruction multiple data (MIMD). Communication between preocessors is done synchronously. The parsing algorithm is a parallel version of Cormack's LR substring parsing algorithm. Each leaf processor partially parses a portion of the original input tokens. Each nonleaf processor parses the partially parsed section received from its right and left children. During parsing a processor maintains subtrees of the sentence's parse tree on its parse stack when a reduction occurs. Each non-root processor passes its parse stack to its parent upon completion of its parse. The root processor prints the resulting parse tree to a file. Timing gains are evident when parsing with the parallel version. Useful work is done on substrings of a reasonably small size when there are internal scopes. The average number of tokens needed for reductions to occur is grammar dependent. Also, the sharing of the parsing task amongst the various processors is dependent on the shape of the derivation tree. Shallow and fat derivation trees benefit the most from the parallelization of the parsing task.
2.3. Error handling during sequential parsing There is a large body of research on error handling during a canonical parse Refs [12, 13] are comprehensive surveys of this research. In addition, Ref. [13, p. 52] provides a list of attributes of good error recovery and [14] defines recovery, repair and correction. Recovery is the continuation of parsing after an error is detected. Repair is the alteration of the parse stack or input stream; it is usually done in order to recover from a parsing error. Correction is the repair that produces a parse of the intended program. The paper [15] and theses [16, 17] contain valuable surveys that emphasize LR parsing. There are several recurring theories in the error handling literature. These include the following: ad hoc methods, panic mode, spelling correction, error productions and syntax-driven repair. Ad hoc methods refer to error handling within a parser that is hand tuned to deal with the individual types of errors. The process is done separately for each language and unanticipated errors often have disastrous consequences. These methods are not of interest to us. Panic mode is discussed in Ref. [12]. This method of error recovery discards tokens until a token is found whose treatment is clear and unambiguous. Then the parser is put in the state that will deal with the distinguished token. Panic mode was the first method developed and is still used in some production compilers. This is due to the ease with which it is implemented in any syntax-driven parser. Its greatest fault is that large portions of text may remain unparsed and, therefore, a single compilation identifies only a subset of the existing errors. Cascading errors can result as well. Variations of panic mode have been developed that require somewhat less text to be skipped but the same difficulties remain, albeit to a lesser extent. Spelling correction is well understood and can be incorporated into most strategies. It usually involves the correction of keywords and not identifiers; simply because semantic information is needed to correctly identify identifiers. With paralled parsing this is further complicated by the DKY problem [11] and is most easily restricted to keyword correction. The augmentation of the grammar to include error productions is mostly criticized for being specific to each grammar, dependent on the expertize of the developer, and producing an ambiguous grammar. Errors that are not anticipated are not usually handled well. A more minor difficulty is the increased size of the parsing tables. Syntax-driven error recovery and repair refers to the process of analysing the tokens surrounding an error and making decisions using some systematic analysis applied to the grammar. The majority of error handling research falls into this category. The advantage of these methods is that they are not tailored to each grammar. Recovery and repair involve the insertion or deletion of tokens or both. Local repair is done at the error detection token and only involves the modification of the input stream, while regional repair is done analyzing the entire substructure, and global repair analyzes the entire program. Regional and global repair may modify the parse stack and the input stream. Allowing modification only to the input stream will never correct some errors; for example,
Error handling in a parallel L R substring parser
251
Table I. Acceptable repair results with the Pascal sample Acceptable % Dain (Method I) (1989) Stifling (a) (1985) Stifling (b) (1985) Sippu and Soisalon-Soininen (1983) [29, 30] Pennello and DeRemer (1978) Wirth (Stifling, 1985) Boullier and Jourdan (1987) IBM Pascal/vs (Spenke et al. 1984) Pai and Kieburtz (1978) [26] Dain (Method 2) (1989) Bugge (1982) [20] Spenke et aL (1984) Mauney and Fischer (1982) Burke (1983) [22, 23]
57 66 66 67 70 72 75 77 77 79 79 91 94 98
the use of the keyworkd "procedure" where "function" was intended. Local, regional and global repair incorporate a forward or backward move or both in order to analyze the appropriateness of the repair. The backward move is done on the parse stack; when allowing this type of move semantic analysis must be reversed or delayed. The forward move further parses to the right of the error detection point in order to condense the information around the error. This assists in a repair decision by providing a better understanding of the context surrounding the error. The ability to parse from an arbitrary position in the input is needed in order to parse in parallel and, hence, the techniques used for the forward move are of interest for parallel compilation. Cost analysis is often applied to compare repair options; costs are attached to the insertion and deletion of each token. The behavior of any algorithm using cost analysis can be altered greatly by altering the individual costs. Again, this is dependent on the expertize of the developer. Pattern matching is one class of syntax-driven repair. This is the process of identifying the most likely sequence of tokens from the existing sequence. The process of deciding on which correct pattern should be used is done using some form of cost analysis. Sequential error handling often involves repair in order to recover. Repair is a complex process that contains inherent difficulties. Repair strategies can result in cascading errors and all such strategies tend to repair some errors well and others poorly. Most repair strategies depend on errors being somewhat isolated and behavior greatly deteriorates as errors become dense. A high cost is often associated with the repair process and it is considered desirable to avoid additional costs when parsing syntactically correct input. A sample of erroneous Pascal programs is provided in Ref. [7]. The reason for the sample is that when analyzing the effectiveness of an error handling technique it is important to understand the types and frequencies of errors. This sample has been widely used by researchers to compare their results with others. It is difficult to combine the results of the relevant papers because the classification of results is subjective and researchers offer different groups of classification. The method of Ref. [17] is used to combine the classification techniques. A repair is considered excellent if it is what a competent programmer might do, acceptable if it is excellent or if no undesirable Table 2. Excellent repair results with the Pascal sample Excellent (%) Sippu and Soisalon-Soininen (1983) [29, 30] Stirling (a) (1985) Stifling (b) (1985) Pennello and DeRemer (1978) Wirth (Stifling, 1985) Pai and Kieburtz (1978) [26] Dain (Method 2) (1989) Dain (Method I) (1989) Bugge (1982) [20] Boullier and Jourdan (1987) Spenke et al. (1984) Burke (1983) [22, 23] CL 19/4---D
36 38 40 42 45 52 55 57 57 75 77 78
252
Gw~N CLARKEand DAVIDT. BARNARD
side effects result, and poor if excessive token deletion or spurious errors result. Naturally, the repair is categorized as poor if the error is not detected. Refs [18-23, 17, 24-32] all demonstrate results acquired when running the erroneous Pascal sample. Tables 1 and 2 summarize the results given in these papers. Ripley and Druseikis found that most syntax errors are single token errors and that most appear one per sentence. Interestingly, Ripley and Druseikis note that programmers tend to correct their first few reported errors and recompile. There seems to be no faith in any subsequent error reports. Attention is being directed toward "user friendly" error reporting. Brown [33, 34] discusses misleading error messages and their frequency. Brown stresses that incorrect messages should not appear and that the error detection position should be correctly identified. Ref. [35] also argues that spurious errors must be completely avoided in order not to mislead the programmer and that accurate information about the expected input is in itself useful information. A method for displaying the right margin of an erroneous region and providing a list of expected input at that position is demonstrated. There are two types of errors distinguished. A first order error is one in which there is only one expected value at the error detection point, and a second order error is one in which there is more than one expected value at the error detection point. Repair is only done for first order errors; with second order errors panic mode is used to continue the parse. In both first and second order errors the error detection position is indicated along with the expected tokens. A subset of the results from running the erroneous Pascal sample is provided. Two user interface features appearing in some compilers are also discussed in Ref. [33]. The first is the integration of editing with syntax analysis. With clearly defined error detection and the ability to present acceptable input the programmer's task is made easier. Secondly, windowing within a display could allow one window to display the error and the context surrounding the error wihle another window displays the corresponding section of the user manual. The context could be expanded upon the programmer's request.
2.4. Error handling during parallel parsing Investigation of error handling for parallel parsers is still in early stages. An error repair strategy for the parallel parsing method presented in Ref. [36] is discussed by Ref [37]. Two problems are noted. First, errors may cascade in both directions and, second, errors may not be repaired. Repair fails when an error is detected to the left of the sequential detection point. This occurs because of the cancellation messages sent leftward. In its favor, the repair strategy makes use of the ability to begin parsing at an arbitrary point in order to gain right context. This facilitates making a better choice for repair. Also, error intervals would be more difficult to locate for the method of parallel parsing bounded context grammars mentioned by Ref. [37]; reparsing to the right to find the right margin would be necessary. This reparsing could be lengthy as cancellation messages can cascade. This is because, unlike Cormack's substring parser, Schell allows the reduction of partial phrases. When such a reduction occurs a cancellation message is sent to the left. 3. THE E R R O R H A N D L E R FOR THE P A R A L L E L LR S U B S T R I N G P A R S E R Error repair is not incorporated into the error handling scheme used by the substring parser. The substring parser was initially developed as a means of noncorrecting error handling. As a result the following common pitfalls of error repair are avoided: • • • • •
the resulting derivation tree does not correspond to the programmer's intended program; cascading errors resulting in spurious error messages; the inability to recover from dense errors; portions of input left unparsed; and high cost.
Parallel parsing further complicates repair. Errors can potentially cascade to the left as well as to the right; when a repair is made it is possibly the input to a processor on the left. If repair is included
Error handling in a parallel LR substring parser
253
within a parallel parse, then reparsing a section already parsed by a processor on the left may be necessary. This is potentially complicated because the section to be reparsed may already have its parse passed along to yet another processor. Avoiding repair in a parallel parser allows processors to work independently and their work is always valid. However, errors cannot be ignored; information concerning errors must be accurately communicated. Error handling during the parallel parsing implementation of the substring parser is accomplished in the same manner as it is in the sequential version. To give as much information as possible to the programmer, interval analysis (the process of parsing in reverse from the error token until an error is again detected) is done in order to identify the left and right margins of the erroneous region. This is necessary because, although the sequential parser has the correct prefix property, the error is not necessarily located at the position of detection. The right margin of an erroneous region is the error token identified during the left to right parse. Once an error is detected parsing commences in reverse beginning with the error token. If this error token is a nonterminal, then the initial input to the inverse parser is the leftmost terminal derived from the nonterminal. All input to the inverse parser is the initial input tokens to be parsed. The inverse parse terminates once an error token is found; the error token is the left margin of the erroneous region. Information about the error is kept on the parse stack in order to pass it up the tree. Then the left to right parse begins again at the token immediately to the right of the right margin in the current input stream. Interval analysis is time consuming and, hence, a further advantage of a parallel parse is that processors can be dealing with different errors simultaneously. Appendix A contains the error handling algorithm. The time complexity for parsing erroneous programs is O(n). When interval analysis is done during error handling, then more space is needed at each processor to store the inverse parse table. This table tends to be much larger than the parse table to parse left to right, but it is not an unreasonable size. For example, in Pascal there are 1007 states in the left to right parse and 1292 states in the inverse. A file that contains the sentence to be parsed is input to the parser. The scanning/screening is done during the initialization phase of the parser implementation in order that tokens be available to each processor. In order to implement interval analysis all of the input tokens must be available at each processor. Alternatively, a leaf could read the section it is to parse and processors could read the input up to the error only when they encounter one. The types of sentences for the trials are discussed in Section 4. In our implementation each processor loads the left to right parse table and the right to left parse table from two separate files. However, the inverse parse tables need only be loaded by a processor if an error is encountered. Whether or not the inverse tables are loaded during initialization or when an error is found does not affect the parsing results of interest. Once the root of the tree completes its parse, it contains the entire derivation tree. Each node of the derivation tree contains a production number used in the derivation of the sentence. The root prints a postfix representation of the derivation tree to a file. Leaves and empty positions in the tree are represented by a value of ( - 1). If there are errors then the partially parsed representation is printed. This consists of the postfix representation of derivation trees separated by the left and right margins of the error occurring between the trees. For each margin the input token is displayed along with the expected input. Two examples of input and corresponding output are found in Fig. 1. One is a syntactically correct sentence while the other is not.
4. E R R O N E O U S TEST SAMPLE FOR THE P A R A L L E L VERSION The memory for individual processors on the machine we used is too small to run Pascal; the Pascal parse table is too large. Sentences of an expression language are tested. The expression language specification is given in Appendix B. Tests are run that demonstrate the ability to carry complete error information up the tree. The clarity and accuracy of error information presented
GWl~NCLARKE and DAVID T. BARNARD
254
(a) Input: 8 + ab • (c + d / f ) / 3 - 4
Output: -1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 6 2 8 5 6 2 3
~
ACCEPT
(b) Input: a+5+(2.3 Output: { ERROR: resulting state is not a final state } { ERR,OR at ( line 1, token 5, inverse state 19 ) EXPECTED: Shift On: bof + Reduee On: -OR{ ERROR at eof line 2, token 1, state 35 ) EXPECTED: Shift On: + - * / ) Reduce On: -1 - 1 - 1 - 1 - 1 5 - 1 - 1 - 1 - 1 - 1 2 ~
REJECT
Fig. 1. Input with corresponding output. to the programmer is further demonstrated in Section 5. The testing, on the parallel substring parser, of erroneous expression programs is divided into two categories, mismatched parentheses and all other errors. A mismatched parenthesis is the most difficult error to handle. It is impossible to isolate the location in which the missing bracket should be placed. The best that can be done is to find the nest scope in the case of a missing "end" or the previous scope when there is a missing "begin". The longest inverse parse occurs when a mismatched parenthesis is discovered. As expected, the information presented at the left and right margins clearly points to a bracketing error but does not locate the correct position for inserting the missing token. As noted by Richter, it is possible to miss scoping errors with a suffix parser. Once recovery from an error is done, the information about opened scopes is lost. With the language used there are only two other possible errors, an omitted operator or an omitted operand. Both are correctly handled. It is also observed that, at times, more errors are discovered in the parallel implementation due to the slicing of the input tokens into sections. The additional information achieved by parallelism is purely an accidental bonus and completely dependent on the arbitrary division of tokens. An implementation to take advantage of this would be impossible since it would require knowledge about errors in the input at the time of division. This section contained the erroneous tests run with a small language on the parallel LR substring parser. The following section discusses tests run with a large language on the sequential LR substring parser. This demonstrates the accuracy of the LR substring parser with respect to error detection and the clarity of error information presented. 5. E R R O N E O U S T E S T S A M P L E FOR T H E S E Q U E N T I A L V E R S I O N The sequential LR substring parser was developed by Cormack [5] to demonstrate a noncorrecting error handler. To further analyse error recovery without repair the parser is run with a standard set o f erroneous programs developed by Ref. [27]. This set of programs has been used to study several error handling methods and will, therefore, allow some comparisons to be made. Also, using an external sample removes any bias in choosing test programs. This section demonstrates on the sequential LR substring parser the following with respect to error handling: • whether or not all errors are detected; • if the error interval is identified for each detected error; and
Error handling in a parallel LR substring parser
255
• if the information presented at the left and right margins of a detected erroneous region is a satisfactory indication of the correction necessary. The inclusion of error handling in the parallel substring parser is discussed in the previous section.
5.1. The test sample The sample consists of a set of errors found within 237 erroneous programs (approx. 12000 lines of code) submitted by graduate computer science students at the University of Arizona. Of the 2816 error messages, 50.7% were duplicate errors within individual programs. The various types of errors were isolated and small programs were designed for each type of error. The resulting sample contains 126 errorneous Pascal programs. Overall, Ripley and Druseikis found that syntax errors tend to be simple and infrequent. The original 237 programs contained the following frequencies and classes of errors: • • • •
60% punctuation errors; 20% operator and operand errors; 15% keyword errors; and 5% other types of errors (e.g. statement out of order).
There was not one occurrence of an error where adjacent tokens were transposed. Only 20.8% of the erroneous statements contained more than one error. 20% of all errors occurred at the end of a line of text. Single token errors made up 88% of the errors; the single token errors consisted of the following types: • 41% were a missing token correctable by inserting a token; • 39% were an incorrect token correctable by replacing the token; and • 8% were an extra token correctable by deleting the token. Ripley and Druseikis provide the correction that they assume for each of the 126 programs; they admit that the categorizing of errors is subjective. Without discussing the intentions with the programmer it is impossible to achieve certainty. Of the 126 programs only 118 are tested; the other 8 have what we consider lexing errors. Within these 118 programs there are 131 errors whose handling is included in our analysis.
5.2. The categorization of the results The judging of the information given by the substring parser is somewhat different in nature to judging repair strategies. With repair strategies the effectiveness is usually classified according to whether the error is detected, how close the repair is to a correction, and whether or not cascading errors result from the repair. With this noncorrecting error recovery the results are classified by whether the error interval is correctly identified and whether or not the legal tokens provided at the margins clearly indicate the necessary correction. Both judgement processes require the assumption that a particular correction is what the original programmer intended. Because Ripley and Druseikis provide the correction that they assume for each of the 126 programs, some similarity in judging parsers using this sample is possible. However, when the repair of the margin listings are not the correction, then it is a subjective decision as to how good the repair or listing is. The categories for error handling for this LR substring parser are as follows: (1) the left and right margins are correctly identified and the correction needed is clear; (2) the left and right margins are correctly identified and the correction needed is reasonably easy to ascertain; (3) more than one error interval is identified for one error but there is a reasonable indication of the correction; (4) the information is confusing; and (5) the error is not identified.
256
GWEN CLARKE a n d DAVID T. BARlqARD
When a single token error is reported the correction needed is considered clear in the following situations: • where there is a missing token and that token is listed as expected input at one or both of the error interval's margins; • when there is an extra token and the token previous the extra token is listed as expected input at the left margin; and • when there is an incorrect token and the correct token is listed as expected input at one or both of the margins. Section 2.3 describes the method used to combine the various categories used by researchers when compiling results from the erroneous Pascal sample. Similarly, for our results excellent will be all those in category 1, acceptable all those in categories 1-3, and poor all those in categories 4-5. Unfortunately, it is impossible to avoid the subjective nature of choosing categories for the results obtained.
5.3. Results of the sequential LR substring parser Table 3 provides the results from running the sequential LR Substring Parser with the 118 erroneous Pascal programs, Overall the error information is clear and, coupled with the correct placement of error margins, often assists the programmer in quickly determining the correction needed. Seven of the errors in category 2 are declarations out of order; the margins are very helpful but the expected values do not indicate the error. Category 2 also contains token replacement errors when the error position is not at the margins. The discovery of two error intervals when one error exists, category 3, happens only in cases where there is more than one token involved within an individual error. Confusion over a particular language's structure causes this type of situation; a Pascal example is the use of "( )" instead of "[ ]" for array subscripting. Category 4 contains instances where a language construct is misunderstood except in this category the misuse involves more tokens or the information is unclear. This category also contains multiple errors that result when comment brackets or quotations are omitted. Category 5 contains only instances of missed errors that occur soon after error recovery. There were no cases of extra or misleading error interval reports. The results are overly optimistic because multiple errors are limited in the sample. The sample has only a limited number of cases when the suffix of a sentence does not match its prefix. An example of this type of problem in Pascal is "'procedure xx (a:integer;):integer;"; the use of "procedure" where "function" is intended is missed because of the ";)" error. Another Pascal example is ' f . . = a , ( b + ) , d ) / e " ; the missing left bracket is not detected because of the " + ) " error. These missed errors are discovered during a subsequent parse.
5.4. Comparison to another noncorrecting algorithm Another noncorrecting method is presented in Ref. [35]. This method identifies the right margin and lists acceptable input. Repair is done if there is just one acceptable input and parsing continues; otherwise, a form of panic mode is used to resume the parse. This noncorrection algorithm also runs Ripley's and Druseikis' sample. Only the results of 12 programs are provided and there is no overall analysis provided. Interval analysis improves the information over the method of Kantorowitz and Laor. In cases where brackets are mismatched interval analysis provides the region in which the bracket must be
Table 3. Results from erroneous Pascal programs Category
Number
I
2 3 4 5
101 11 13 3 3
Excellent Acceptable
75% 95%
Error handlingin a parallel LR substring parser
257
inserted; an error example of this is "x,=4*x + 5);". With bracketing errors the identification of the left margin is helpful. An example poorly handled by Kantorowitz and Laor and well handled by the LR substring parser is "procedure x x :integer;". Kantorowitz and Laor would repair the ..... with a ";" which would result in spurious errors and the option of replacing "procedure" with "function'" would not be displayed. In cases where the left and right margins of an error interval are adjacent the method of Kantorowitz and Laor is as sufficient as interval analysis; an error example of this is "var a,x,m,n,;". The use of panic mode to recover by Kantorowitz and Laor allows errors to be missed in cases where they would be flagged by the LR substring parser. An example is "const n = 5: b:=4; var a: integer;"; the "~=" would not be caught by Kantorowitz's and Laor's scheme. Kantorowitz and Laor provide support for noncorrecting error handling. Their examples further demonstrate how useful information identifying the error token and expected input is. 5.5. Comparison to error repair algorithms
The result of the LR substring parser's error handling show two major improvements over the statistics gathered on the repair algorithms. The statistics for the repair strategies that use the same sample are found in Tables 1 and 2. The first major improvement is the large number of errors whose handling can be considered excellent. This is largely due to the determination of the left margin when an error is detected. Most repair strategies perform a local or a regional repair around the error token. The second improvement is the large decrease in the errors whose handling is poor. This is attributed to the avoidance of any form of panic mode within the LR substring parser and, hence, fewer errors are missed. Also, excessive token deletion and spurious errors are avoided. These three factors contribute to the poor handling of errors in repair strategies. A comparison of the handling of densely occurring errors is not possible; the Pascal sample isolates errors. However, it is expected that LR substring parsing is superior in the handling of dense errors. The LR substring parser would detect most of the errors and be able to continue parsing; unfortunately, all left context is lost in the erroneous section. Repair strategies either terminate the parse when errors are too dense or skip large portions of input to recommence parsing from a stable position. Ripley and Druseikis found that errors are generally isolated. When possible, a noncorrecting error handler is a viable alternative to repair. In some instances this method of recovery without repair may be considered superior.
6. S U M M A R Y AND C O N C L U S I O N S In addition to the timing gains achieved with the parallel LR substring parser, error recovery is easily accomplished and the information produced is accurate and helpful. With the development of parallel parsing algorithms that contain a left parse and a right parse, error recovery without repair or correction may become the norm. Parallelizing parsing in itself presents new aspects to error handling. The most obvious advantage is that several errors can be processed simultaneously. There are several advantages of not including error repair when parsing in parallel; no reparsing is necessary and spurious errors are avoided. With an associative operator for parallel parsing, portions of input are not left unparsed. An associative operator enables the parse to begin at any token; hence, the parse may begin immediately following an error. This is in contrast to all methods that apply some version of panic mode recovery. Once right and left parses are possible then the nature of error handling changes. The right and left bounds of an error region are detectable; we have shown that this information in itself directs the user toward the error. Acknowledgements--The authors wishto thank Dr G. V. Cormackfor the LR substringparsingtables and Dr G. D. Ripley for the erroneous Pascal programs, and the anonymousreferee for helpful comments.
258
GWEN CL^ar,~ and DAVID T. BARNARD REFERENCES
I. Grafter, N. M. Parallel Incremental Compilation. Department of Computer Science, University of Rochester: Ph.D. Thesis, Technical Report 349:113 pages; June 1990. 2. Skillicorn, D. B. and Barnard, D. T. Parallel Compilation:A Status Report. External Technical Report ISSN-0826-022790-267, Queen's University; March 1990. 3. Clarke, G. An LR Substring Parser Applied in a Parallel Environment. Queen's University: Master's thesis; 1991. 4. Clarke, G. and Barnard, D. T. An LR Substring Parser Applied in a Parallel Environment. Unpublished; 1991. 5. Cormack, G. V. An LR substring parser for noncorrecting syntax error recovery. SIGPLAN Notices 24(7): 161-169; July 1989. 6. Richter, H. Noncorrecting syntax error recovery. ACM Trans. Prog. Lang. Syst. 7(3): 478-489; July 1985. 7. Ripley, G. D. and Druseikis, F. C. A statistical analysis of syntax errors. Comput. Lang. 3: 227-240; 1978. 8. Aho, A. V. and Ullman, J. D. The Theory of Parsing, Translation,and Compiling. Englewood Cliffs, N. J.: Prentice-Hall; 1972. 9. Floyd, R. W. Bounded context syntactic analysis. Commun, ACM 7(2); February 1961. 10. Williams, J. H. Bounded context parsable grammars. Information Control 28: 314--334; 1975. 11. Wortman, D. B. A concurrent modula-2+ compiler. Workshop on Parallel Compilation proceedings; 1990. 12. Barnard, D. T. Syntax Error Handling Techniques. Queen's University: Technical Report 81-125; 1981. 13. Hammond, K. and Rayward-Smith, V. J. A survey on syntactic error recovery and repair. Comput. Lang. 9(I): 51-67; 1984. 14. Homing, J. J. What the Compiler Should Tell the User. In Compiler Construction: An advanced Course F. L. Bauer and J. Eickel, (eds.); pp. 525-548. Berlin, Springer-Verlag; 1976. 15. Sippu, S. Syntax Error Handling in Compilers. Department of Computer Science, University of Helsinki: Technical Report A-1982°1; 1981. 16. Logothetis, G. On the Automatic Generation of Parsers Providing Locally Least-Cost Repairs. University of Pittsburgh: Master's thesis; 1983. 17. Dain, J. A. Automatic Error Recoveryfor LR Parsers. University of Warwick: Ph.D. thesis; 1989. 18. Boullier, P. and Jourdan, M. A new error repair recovery scheme for lexical and syntactic analysis. Sci. Comput. Programming 9(3): 271-286; 1st December 1987. 19. Bugge, E. Implementing and Assessing Locally Least-Cost Error Recoveryfor Pascal. Heriot-Watt University: Master's thesis; 1982. 20. Anderson, S. O., Backhouse, R. C., Bugge, E. H. and Stirling, C. P. An assessment of locally least-cost error recovery. Comput. J. 20: 15-24; 1983. 21. Burke, M. G. A PracticalMethodfor LR and LL Syntactic Error Diagnosis and Recovery. New York University: Ph.D. thesis; February 1983. 22. Burke, M. G., and Fisher, G. A practical method for LR and LL syntactic error diagnosis and recovery. ACM Trans. Program. Lang. Syst.; April 1987. 23. Burke, M. G. and Fisher, G. A practical method for syntactic error diagnosis and recovery. In Proceedings of the Sigplan 82 Symposium on Compiler Construction: 67-78; June 1982. 24. Mauney, J. and Fischer, C. N. A forward move algorithm for LR and LL parsers. Proceedings of the Sigplan 82 Symposium on Compiler construction: 79-87; June 1982. 25. Pai, A. and Kieburtz, R. B. Global context recovery: a new strategy for syntactic error recovery by table-driven parsers. ACM Trans. Prog. Lang. Syst. 2(1): 18-41; 1980. 26. Pai, A. and Kieburtz, R. B. Global context recovery: a new strategy. ACM SIGPLAN Notices 14(8): 158-167; 1979. 27. Pennello, T. M. and DeRemer, F. A forward move algorithm for LR error recovery. Conference Record Fifth Annual Symposium on Principles of Programming Languages: 241-254; 1978. 28. Sippu, S. and Soisalon-Soininen, E. A syntax-error-handling technique and its experimental analysis. ACM Trans. Prog. Lang. Syst. 5(4): 656-579; October 1983. 29. Sippu, S. Experiments With an Error Handling Technique. Department of Computer Science, University of Helsinki: Technical Report C-1982-7; 1982. 30. Sippu, S. and Soisalon-Soininen, E. Practical Error Recovery in LR Parsing. ConferenceRecordon the 9th Annual ACM Symposium on Principles of Programming Languages: 177-184; January 1982. 31. Spenke, M., Miihlenbein, H., Mevenkamp, M., Mattern, F. and Beilken, C. A language independent error recovery method for LL(1) parsers. Soft. Pract. Exp. 14((11): 1095-1107; 1984. 32. Stirling, C. Follow set error recovery. Soft. Pract. Exper. 15(3): 239-257; 1985. 33. Brown, P. J. Error messages: the neglected area of the man/machine interface? Commun. ACM 26(4): 246-249; 1983. 34. Brown, P. J. My system gives excellent error messages--or does it? Soft. Pract. Exp. 12: 91-94; 1982. 35. Kantorowitz, E. and Laor, H. Automatic generation of useful syntax error messages. Soft. Pract. Exper.; July 1986. 36. Mickunas, M. D. and Schell, R. M. Parallel Compilation in a Multiprocessor Environment (Extended Abstract). J. ACM 29(2): 241-246; 1978. 37. Schell, Jr. R. M. Methods for Constructing Parallel Compilersfor use in a Multiprocessor Environment. University of Illinois at Urbana-Champaign: Ph.D. thesis; 1979.
(Appendix on opposite page)
Error handling in a parallel LR substring parser APPENDIX
A
Error Handler for the Parallel LR Substring Parser index = index into input of current token
push onto parse stack (current state × ( - 1 ) , i n d e x ) / * RIGHT MARGIN * / return position = index inverse parse stack = empty inverse state = inverse start state while an error is not found if inverse parse table (inverse state,tokens[index]) is not negative then / * SHIFT * / inverse state - inverse parse table (inverse state,tokens[index]) push (inverse state,index) onto inverse parse stack decrement index else if inverse parse table (inverse state,tokens[index]) is not empty then / * REDUCE * / rule = inverse parse table (inverse state,tokens[index]) * -1 pop (size of RHS of rule) items off of the inverse parse stack inverse state = state at top of stack inverse state = inverse parse table (inverse state,LHS of rule) push (inverse state,index) onto inverse parse stack decrement index else ERROR found push onto parse stack(inverse parse state × ( - 1 ) , i n d e x ) / * LEFT MARGIN * / index = return position current state -- start state
APPENDIX
B
Productions for the Expression Grammar
s ~ bof expression eof expression --~ term expression -* expression + term expression ~ expression - term term --~ factor term --~ term • factor term --+ term / factor factor -~ id factor --~ ( expression ) factor --~ constant factor ---+ id ** factor factor ~ ( expression ) • , factor
factor --* constant ** factor About tile AUIhor---GwEN CLARKE is currently a research associate at Queen's University. She is working on the MacDonald project, a program development environment wherein management of programs and supporting documents is integrated into a consistent uniform framework. She recently completed her MSc. at Queen's specializing in parallel parsing and error recovery. Abom the Autbor--DAvm T. BARNARDjoined the Department of Computing and Information at Queen's University in 1977, having studied at the University of Toronto. He is now Professor in that Department and Associate to the Vice-Principal (Resources), in which capacity he is responsible for academic support services and several aspects of human resources policy. His research applies formal language analysis to compiling programming languages with a focus on using parallel machines, and to treating documents as members of a formal language.
259