(7,~rnpule" Lan.lua~le, Vol 6. pp 95 to l/)7 lggl
i)INfl-055I 81 02(XY05-13$02.0~3 O Cop?-rlght C 19'~1 P~rgamon Pres~ Ltd
Printed in Gr~at grIIaln. All rights re-served
AN LR PARSING TECHNIQUE FOR EXTENDED CONTEXT-FREE GRAMMARS* AUGUSTO CELENTANO lstituto di Elettrotecnica ed Elettronica, Politecnico di Milano, Piazza L. da Vinci 32. 1-20133 Milano, Italy (Receiced 21 October 1980; revision received 18 March 1981)
Abstract--Extended context-free grammars allow regular expressions to appear in productions right hand sides, and are a clear and natural way to describe the syntax of programming languages. In this paper an LR parsing technique for extended context-free grammars is presented, which is based on an underlying transformation of the grammar into an equivalent context-free one. The technique is suitable for inclusion in one-pass compilers: the implementation requires little extensions to the algorithms working for normal LR grammars. Besides describing the parsing method, the paper shows also the algorithms for deriving the parsing tables; tables optimization is also discussed. Finally, this technique is compared with other proposals appeared in the literature. LR parsing Regular expressions Grammar transformation.
Extended context-free grammars
Parser optimization
1. I N T R O D U C T I O N
EXTENDED context-free grammars, which use regular expressions (or equivalent notations) in the right hand side of the productions, are often more clear and natural ways to describe the syntax of programming languages, instead of pure BNF notation. A lot of attention has been devoted to the problem of constructing parsers for them [1-6]; it is well known that top-down parsers, mainly recursive descent ones, can be derived in a simple way from this kind of syntactical description I-7, 8]: the parsing algorithm "walks through" the regular expressions selecting the actual instances according to the symbols found in the input stream. On the contrary, bottom-up parsers seem to require major changes with respect to the context-free form; the reason is in their behaviour: input symbols are read and stacked until the right end of the handle is reached, at which point the whole handle is popped and reduced to its corresponding nonterminal symbol. The presence of regular expressions forces the parser to investigate for the left end of the handle, which can be arbitrarily distant from the stack top, in order to recognize which instance has been scanned. This can be particularly expensive in LR parsers, where normally no stack examination is required to perform a correct handle reduction. Since extended context-free grammars generate context-free languages, a first solution proposed was to automatically transform the extended grammar into a context-free one [5, 6]. This approach presents the drawback that the resulting parser is no longer related to the original grammar description, forcing the designer to be aware of the transformation performed, in order to complete the parser with translation routines. Several authors have discussed the problem of extending LR notations and algorithms to deal with extended context-free grammars [2, 5, 9, 10]. Some of their proposals will be discussed later; in general the suggested extensions are focused on reduce actions, which are modified and augmented with operations that examine the stack contents to find the range of the handle, according to the instances of iterations and alternatives (inside instances of regular expressions) scanned. * Work supported by Consiglio Nazionale delle Ricerche, Centro Studi per l'Ingegneria dei Sistemi per I'Elaborazione delle Informazioni. 95
96
AUGUSTOCELENTANO
In this paper a different technique is proposed: selections and iterations in instances of regular expressions are recognized (and popped from the stack) during the scan of the handle, and not at the moment of a reduction. This technique seems to violate the constraint of bottom-up parsing that (an instance of) a rule can be recognized only when it is completely scanned but, as it will be shown, it derives from an underlying transformation of the extended context-free grammar into context-free form. The paper is divided into 9 sections. Section 2 introduces some definitions and the notation used. In Section 3 the technique is illustrated informally with the aid of an example. Sections 4, 5 and 6 are devoted to the construction of the set of states and the parsing tables, and to the parsing algorithm. In Section 7 some optimizations are discussed. A comparison of this techniques with those proposed by Madsen and Krinstensen [5] and LaLonde [2] precedes the conclusions. 2. D E F I N I T I O N S AND N O T A T I O N S
The reader is referred to Aho and Ullman [i1] for basic definitions of context-free grammars and LR(k) grammars. We use their terminology and notation. Madsen and Kristensen [5] provide definitions for regular expressions and extended context-free grammars to which we adhere. We repeat here the definition of extended context-free grammar: an extended context-free grammar (ECFG) is a 4-tuple G = (VN, VT, P, S) where (1) VN is a finite set of nonterminal symbols (2) VT is a finite set of terminal symbols, VN ~ VT = ~b (3) P is a finite set of productions in the form A---* ~ where A ~ VN and ~ is a regular expression over VN ~ VT (4) S is a distinguished symbol of VN called axiom. Given a grammar G = (VN, VT, P, S), it is convenient to consider the augmented grammar G' = (VNW S',VT w { ~ } , P u {S'---* Sq }, S'), where S ' ~ V N and q ~VT; ~ is called the right endmarker, and is used to signal the end of a sentence. In the sequel we shall consider augmented grammars. The following notation will be used, unless otherwise stated: (1) words in uppercase letters denote nonterminal symbols (2) words in lowercase letters, operators and punctuation marks are used as terminal symbols (3) Greek letters denote strings of, or regular expressions over, terminal and nonterminal symbols (4) E denotes the empty string (5) in regular expressions, the notation {a}* means a repeated zero or more times, the notation {~z 1~21... t~,} means exactly one of ~z . . . . . ~,. 3. AN INFORMAL DESCRIPTION OF THE METHOD The technique here proposed is based on an underlying transformation of the extended grammar into context-free form, according to the following rules: on the left side extended productions are shown, on the right side the equivalent context-free rules are shown.
(I)
(2)
A--~ ~{fl}*?
A --~ ccB7 B---,E B--* Bfl A --~ ~B? B-~flz B--~ f12
An LR parsing technique for extended context-free g r a m m a r s
97
It is assumed that for each transformation a new nonterminal symbol (denoted with B) is introduced, unless two symbols are replacements for identical regular expressions, in which case the same nonterminal will be used. The basic idea of the parsing technique is that the behaviour of the parser for the extended grammar should be related (in a way to be specified later) to that of the parser for the equivalent context-flee grammar, that is it should be possible to establish a correspondence between the parsing actions performed by each parser. We will assume the following grammar G1 as an example: P R O G R A M --~ begin STATS end STATS ~ STAT [ ; STAT ~* STAT --, if-expr-then STATS [else STATS IEI fi other An equivalent context-free grammar G2, obtained by systematically applying the transformation rules introduced above, is the following one: P R O G R A M ---, begin STATS end STATS --* STAT TAIL TAIL --,E TAIL; STAT STAT ---* if-expr-then STATS ELSEP fi other ELSEP ---* else STATS Suppose we want to parse the string begin if-expr-then other; other fi end In the sequel P1 will denote the parser for G1 (i.e. the parser for the extended grammar) and P2 will denote the parser for G2 (i.e. the parser for the context free grammar). The parser P2 starts by stacking the three symbols begin, if-expr-then and other, together with their associated tables; P1 should, up to this point, behave the same way. P2 announces that the rule STAT--, other has been recognized, and performs a reduction leaving the stack in the following configuration (we drop tables from the stack in this example, since we have not yet shown how to construct those for P1. The top of the stack is the rightmost symbol). begin if-expr-then STAT Since the rule S T A T - * o t h e r appears in G1 too, parser P1 must perform the same action, leaving the stack in the same configuration. Up to this point regular expressions have not entered into the picture, and the behaviour of both parsers is therefore identical. Upon seeing the symbol • the parser P2 signals a reduction according to the ru-le TAIL---*¢ pushing the nonterminal symbol TAIL onto the stack. Recalling that nonterminal symbol TAIL has been introduced in G2 to represent the interation { ; STAT}* of G1, this reduction can be given the following meaning in terms of language constructs: an occurrence (empty occurrence) of an iteration has been seen; the actual stack configuration is such that further (non empty) occurrences can be parsed, or the end of the iterative part signalled (depending on the input symbol). Parser P1 should exhibit a similar behaviour, that is it should announce an (empty) occurrence of an iteration, and provide for parsing further occurrences or signalling the end of the iterative group. We can "simulate" this behaviour by pushing a new symbol onto the stack of P1, standing for "one or more occurrences of the iterative group {; STAT]*", as much as the symbol T A I L of G2 means" if we denote such a new symbol with "'[" STATI*", the stack of P1 will be begin if-expr-then STAT ,,s.~, STAT}*"
AUGUSTO CELENTANO
98
Parsing can now proceed by stacking the symbols : and other, and announcing, by both parsers, another reduction according to the rule STAT----, other. Another occurrence of the iteration has now been completed. P2 reduces according to the rule TAIL---* TAIL ; STAT, shrinking the information about the previous (empty) and the present (non empty) repetitions of the iterative group into the unique symbol TAIL. Parser PI should also signal the new occurrence of the iteration; once recognized, the parser must be driven into a state such that, again, more occurrences can be parsed, or just their end announced. This information is associated" with the (state corresponding to the) new symbol "{ ; STATI *", so the actual occurrence of the iteration (i.e. the symbols ; and STAT on the stack) can be simply popped off, leaving yet the symbol "{; STATI*" on top of the parsing stack?. The next input symbol is fi, which ends STATs repetitions. Both parsers can reduce according to the rules STATS --~ STAT { ; STAT} * STATS--~ STAT TAIL respectively for P1 and P2. As far as PI is concerned, no stack examination is necessary to find the left end of the handle, since iterations have already been collapsed, Two items must be popped, as if the rule had only two symbols in the right hand side (we shall define precisely in the next section the notion of equivalent length of a regular expression). The stack configuration is at this point the following one for both parsers begin if-expr-then STATS Selections can be handled in a similar way. Parser P2 recognizes one of two alternatives of a selection, reducing according the rule ELSEP--,E. The alternative just seen is replaced by the nonterminal symbol ELSEP (actually ELSEP is simply pushed, since the right hand side is empty). In a similar way P1 can replace the empty alternative by a unique symbol "{else STATS [E}", leaving the stack in the configuration begin if-expr-then STATS "{else STATS [E}" The symbol fi is then stacked, and rule STAT---* if... fi completed. Reduction can be performed by P1 again without examining the stack, since the selection part, which could have alternatives of different length, is represented by a unique item; right hand side is again of known length. The parse completion is trivial for both grammars. Before discussing in detail the parsing algorithm and the tables construction algorithms, we point out a number of considerations; a deeper comparison between this and other proposed techniques is delayed to Section 8. (1) As we noted in the introduction, this technique violates the constraint that a bottom-up parser cannot make any assumption about the production(s) being parsed until the production is completed. The approach taken is consistent with the equivalence criterion assumed between extended grammars and context-free grammars; it is however a matter of more theoretical than practical concern: cases as illustrated by the following example S - * a {b}* c S--~a {b}* d can be parsed by this method if, as assumed, we drop the distinction between b's generated by rule 1 and b's generated by rule 2. If we want to keep distinct the two iterations, we would obtain the following context-free grammar S -*aBlclaB2d Bl---,e I B1 b B2---* E I B2 b "t In context-free g r a m m a r s reductions according to left recursive rules can be implemented by simply popping n - I items from the stack, if n is the length of the rule.
An LR parsing technique for extended context-flee grammars
99
which is not LR(k). Other methods [2, 5] can however treat these cases, but the existence of such a distinction is not likely to be present in practice. (2) This parsing technique is suitable for one-pass compilation: the syntax tree according to the extended g r a m m a r is not directly derivable, since iterations and selections are not remembered during reductions. The recognition of parts of a rule however allows a great flexibility in the definition of semantic actions. A drawback of other cited techniques is that no translation routine can be associated to selections and iterations since they are announced only at the end of the rule. (3) Reduce actions (and popping of iterations or replacing of selections) do not require to examine the stack, since the length of the handle is known in advance. (4) The number of states of the extended parser is the same as of the parser for the equivalent context-free g r a m m a r since, in order to "simulate" the behaviour of the latter. additional states are introduced for keeping track of iterations and selections encountered; a limited space saving derives from the absence of a number of nonterminal symbols in the extended grammar. By virtue of the close relation with normal LR parsing techniques, however, several optimizations can be performed on tables, which are straightforward extensions of techniques for context-free parsers. Other optimizations, which are peculiar to this approach, will be introduced in Section 7. 4. CONSTRUCTION OF THE SET OF LR(0) STATES As in the construction of the set of states for context-free grammars, two fundamental algorithms are involved [11, 12]: the closure of a state, and the construction of the successors of a state. Some additional rules must be considered in order to work with extended grammars, which have been introduced and refined by Early [10], DeRemer [9], Madsen and Kristensen [5]; they are reported here in a form suitable to our proposal. As a shorthand, items in the form A ~
2:1 --cz22:3
A ~
9{12:2 - - 2 : 3
appearing in the same state will be merged into the unique item A ~
2: I - - 2 : 2 - - 2 : 3
(1) The following rules must be added to the algorithm for computing the closure of a state: (a) Any item of the form A--* 2:__{fl}'7 is replaced by the item A--*a4f {fl}*7 (b) Any item of the form A---* a{fl__}*? is replaced by the item A---* ~{fl# }*'2 (c) Any item of the form A~2:__{/~ll//21...I/~n}7 is replaced by X --, 2:{--fll J--fl: I... [--//,}7 (d) Any item of the form A--':({/~11...I/~__1...I/3,}7 is replaced by A ~ a{fl, I... Ifli 4f I ... I/~,}'/ (e) Any item of the form A--~ : ( is replaced by A---* c(4f (this rule is introduced only for uniformity of notation). The items with a symbol # inside are not considered further during the closure. According to Ref. [5] they are called reduce items; the m a r k # identifies points in the rule where some special action needs to be performed in order to recognize which instance is actually under parsing. (2) When constructing the successors of a state and the G O T O function, the following two rules must be obeyed: (a) If a state Q contains an item A - - , 2 : # { f l ] * 7 add the state Q ' = C L O S U R E (A---,~[_fl}*__7) to the set of states, and let G O T O (Q, U ) = Q' for all U e FIRST({fl }*7 ° F O L L O W ( A ) ) t t FIRSTK([fl',*7)= FIRSTK(fl)w FIRSTK(flT)w FIRSTK(~'7)... Since we consider FIRSTt, only the first two items need to be taken into account.
AUGUSTOCELENTANO
I00
0
1
PROGRAM
_begin STATS end
PROGRAM
begin _STATS end
STATS STAT
STAT ( ; STAT}* if-expr-then STATS {else STATS
STAT
_other
2
PROGRAM
begin STATS end
3
STATS
STAT # {; STAT}*
STAT
if-expr-then
4
STATS
STATS {else STATS
STAT
if-expr-then STATS {else STATS other
5
STAT
other #
6
PROGRAM
begin STATS end #
7
STATS
STAT { ; STAT}* #
8
STAT
if-expr-then STATS{ else STATS
9
STATS
STAT {;
STAT
_if-expr-then STATS(else STATS
lO
STAT STATS STAT STAT
~} f i
STAT £; STAT)*
STAT
STAT .
e} f i
e} f i
e#} f i
STAT)* E} f i
other if-expr-then STATS{else STATS
E} f i
STAT { ; STAT}* _if-expr-then STATS(else STATS
e} f i
other
II
STATS
STAT ( ; STAT #}*
12
STAT
if-expr-then STATS{else STATS # I c} f i
13
STAT
if-expr-then STATS(else STATS I e}
14
STAT
if-expr-then STATS£else STATS I e} f i #
fi
Fig. 1. The set of LR(O)states ~r G1.
(b) if a state contains an item A---,e{#l[...I/3i#l...[/3£}? add the state Q " = C L O S U R E ( A - - , e { # l l . . . I # , } _ _ V ) to the set of states, and let GOTO(Q, U) = Q", for all Ue FIRST(v; FOLLOW(A)). This leads to a generalization of the G O T O function on terminal symbols, whose meaning can now be expressed as follows: GOTO(Q, U) = Q' if the parser, when it is in state Q, will go next to state Q' upon seeing U in input, possibly after having made some stack modification (as it happens in replace actions, see later), independently whether or not the input symbol is shifted, i.e. removed from the input (as it happens in replace and push actions). In Fig. 1 the set of LR(0) states for grammar G1 is shown.
5. CONSTRUCTION OF THE PARSING TABLES We shall consider only the SLR(1) case for simplicity. From the set of states the functionsf(parsing action function) and g (goto function) can be derived. The domain of the function f is extended to include other actions which correspond to recognition of
An LR parsing technique for extended context-free grammars
101
occurrences of alternatives and iterations; these actions, which will be called intra-rule reductions, are the following ones:
push [#~,*: the parser is expecting occurrences of an iteration {fl}*; it is then driven into a state such that occurrences of the iteration expected can be parsed
pop #: an occurrence (non-empty) of an iteration {/7}* has been completed: the states associated to the occurrence are removed from the stack
replace fli: an occurrence of a selection {fill... I # i l . . . 1#,} has been completed; the states associated to the occurrence are removed, and replaced by a unique state. The parsing action function f is derived in this way: for each state Q and each terminal symbol u (1) f(Q, u) = shift if A---, c( u# is in Q (2) f(Q, u) = reduce A ~ :( if A ---, c( # is in Q, and uE FOLLOW(A) (3) f(Q, u) = push {fl}* if A---,c(:FI:{#}*7 is in Q, and uE FIRST({/3}*7o FOLLOW(A)) (4) f(Q, u) = pop ~ if A----, ~{/3@ }'7 is in Q, and u~ FIRST({/3}*7o FOLLOW(A)) (5) f ( Q , u ) = replace /3i if A ~ { / 3 1 1 . , . 1 / ~ i @ ] . . . I / ~ . } 7 FIRST(?° FOLLOW(A)) (6) f(Q,-~) = accept if S'----~S -~ is in Q (7) f(Q, u) --
error
is in Q, and
u£
otherwise
The goto function determines the next current state: for each state Q and each symbol X ~ VN u
VT (1) g(Q, X) = G O T O ( Q , X) if G O T O ( Q , X) is defined (2) g(Q, X) = error otherwise.
The notion of parsing conflict needs obviously to be extended to take care of the parsing actions push, pop and replace: the reader should recognize that there are no parsing parsing action function begin 0
end
;
3
go to function
else
other
q
begin end
;
if..
fi
else other
q
STATSSTAT
1
shift shift
shift
4
5
2
3
5
8
3
.
push push (;STAT}* { ;STAT}*
push push { ;STAT}" £ ;STAT}*
7
shift
4
5
fi
shift
I
2
if,.
7
7
4
red 4
red 4
red 4
red 4
red 2
shift
red 2
red 2
repl
shift
shiftaccept!
6 7 8
9 13
lO
9
shift
shift
4
5
I0
shift
shift
4
5
II
pop ;STAT
pop ;STAT
12
pop ;STAT
repl else STATS shift
13 14
pop ;STAT
red 3
red 3
red 3
13 14 red 3
Fig. 2. Parsing tables for GI.
II 12
3
AUGUSTO CELENTANO
102 parsing action function begin 0
end
;
fi
go to function
else
shift
;
if..
fi
else
other
shift
3
red 3
red 3
4
shift
red 3
2
3
7
red 6
red 6
4
shift red 6
5
8
3
red 6
accept
6
red 2
shift
red 2
red 2
13
red Q shift
8 9
shift
shift
4
5
I0
shift
shift
4
S
red 4
red 4
red 4
12
red 7
13
shift
14
STATSSTAT T A I L ELSEP
red 3
shift
4
II
q
6
2
7
b e g i n end
q
other
shift
l
5
i f..
red 5
red S
ll 12
3
red 4
14
red S red 5
Fig. 3. Parsing tables for G 2 .
conflicts if and only if the equivalent context-free grammar (derived accordingly to Section 3) is SLR(1). In Fig. 2 the parsing tables for G1 are shown. There is a strict correspondence between the intra-rule reductions and the reductions performed by a parser for the equivalent context-free grammar; push actions of the extended parser correspond to reductions (of the context-free parser) according to rules B ~ E introduced in part 1 of the transformations in Section 3; pop actions correspond to reductions according to rules B ~ B/3; replace actions correspond to reductions according rules B ~ / 3 i of part 2 of the same transformation rules. This correspondence is further evidenced by comparing the tables for G1 with those for G2, which are shown in Fig. 3. 6.
THE
PARSING
ALGORITHM
Before introducing the parsing algorithm, we define the notion of equivalent length of a regular expression, which will be used in pop and replace actions. The purpose of this measure is allowing to perform stack manipulations without examining its contents, by computing in advance the number of items involved in each reduction or intra-rule reduction. The equivalent length (el) of a regular expression is defined recursively: (i) (2) (3) (4)
the equivalent length of a single terminal or nonterminal symbol is 1 if a is a regular expression, the equivalent length of {~}* is 1 if al . . . . . a, are regular expressions, the equivalent length of {all ... I~,} is 1 if a and fl are regular expressions, the equivalent length of aft is el(a) + el(fl).
From this definition it follows that the equivalent length of the right hand side of a rule is equal to the length of the right hand side of the corresponding context-free rule, according to Section 3. The parsing algorithm is described by the following procedure, which uses a number of service routines: (1) pushstack(Q): pushes state Q into the parse stack
An LR parsing technique for extended context-free grammars
103
(2) popstack(n): pops n items out from the stack (3) topstack: returns the name of the state on top of the stack (41 el(7): returns the equivalent length of regular expression :~ (51 nextsymbol: returns the next input symbol. It is assumed that the last symbol is the right endmarker -t. Qc denotes the current state; Qo is the initial state; sy is the last symbol read; f a n d g are, respectively, the parsing action and goto functions. As usually in LR parsers, only states names are stored into the stack [11]. procedure parse; begin
{establish initial configuration } pushstack (Qo) ; Qc : = Qo ; sy : = nextsymbol ; {parsing loop} while f(Qc, sy) 4= accept do case f(Qc, sy) o f shift:
pushstack(g(Qc, sy)); sy : = nextsymbol
reduce A ~ ~ :
popstack(el(~)); pushstack(g(topstack, A)) pushstack(g(Qc, sy)) popstack(el(//))
push {ill * pop ~:
popstack(el(/3i)); pushstack(g(topstack, sy)) Icall an error recovery routine or halt}
replace fii: error : esac;
Qc : = topstack od end.
Figure 4 shows the parse of the string begin if-expr-then other ; other fi end -~. parsing stack
input text
0
parsing action
begin i f - e x p r - t h e n o t h e r ; o t h e r f i
end~
if-expr-then other ; other fi
endq
shift
o t h e r ; Other f i
endq
shift
0 1 4 5
; other fi
endq
reduce 4
0 1 4 3
; other fi
endq
p u s h { ;STAT}*
0 1 4 3 7
; other fi
endq
shift
0 1 0 1 4
0 1 4 3 7 9
shift
other fi
endq
shift
0 1 4 3 7 9 5
fi
endq
reduce 4
0 1 4 3 7 9 II
fi
endq
pop ;STAT
0 1 4 3 7
fi
endq
reduce 2
0 1 4 8
fi
endH
replace
0 1 4 8 13
fi
endq
shift
0 1 4 8 13 14
endH
reduce 3
0 1 2
endH
shift
0 1 2 6
q
accept
Fig. 4. Parse ofthe string begin i~expr-then other;other fiend 4-
104
AUGUSTO CELENTANO 7. P A R S I N G
TABLES
OPTIMIZATION
Due to the close relation between normal LR tables and the tables constructed by our method for extended context-free grammars, it is possible to perform on these the same kind of optimizations which are suggested for LR tables, such as merging of compatible tables, postponement of error checking, elimination of LR(0) reduce states and elimination of single productions [11, 13] (this last optimization being perhaps unnecessary: single productions are in fact unusual in extended context-free grammars). These improvements will not be shown here, since they are derivable in a straightforward way from those working for LR tables. We will discuss two improvements which are peculiar to this technique, and reveal to be accomplishable and effective for practical programming languages [14]. The first improvement concerns with the elimination of push states, that is states whose unique parsing action different from error is push {fl}* for some ft. Such states Q can be
parsing action function a 0
+
*
I
go to function q
a
+
*
I
q
3
s h ift
l
E
T
1
2
accept
push
push
£( }T},' £( ) T } *
push
£( ) T ) *
2
push
push push push
4
shift
shift
5
red 3
red 3
3
:((}a)" {(}a)* {()a)* (()a)*
shif t
shift
4
4
push
5
5
red 2
6
7
(()a)*
red 3
6
repl +
10
7
rep] -
I0
8
repl *
II
g
repl /
II
lO
shift
3
II
shift
13
12
pop pop (+I-)T ( +I-}T
pop (+I-)T
13
pop pop { +l-}a ( + l - ) a
pop (+l-}a
*
I
5
5
8
9
5
12
go to function
parsing action function +
4
q
a
+
*
I
q
3
shift
1
accept shift
shift
red 3
red 3
4
red 2 shift
shift
shift
3
shift
7 pop pop {+I-}T{+I-}T
4 5
red 3
E
5
pop {+I-}T
pop pop pop pop pop {+l-}a {+l-}a {÷l-}a {+I-}a {+I-}a
Fig. 5. (a) Parsing tables for G3. (b) Optimized parsing tables for G3.
2
An LR parsing technique for extended context-free grammars
105
eliminated if, for all U SVT such that f(Q, U) = push {fl}*, giQ, U) specifies the same state Q'. State Q brings no additional information with respect to Q', in fact the only action of Q is to set Q' as the new current state lif f(Q, U) = error, then also f(Q', U) = error). Q and Q' can thus be merged, eliminating the push action, and we must accordingly modify the definition of equivalent length of an iterative group {fl}* as being equal to zero. The second improvement concerns with elimination of replace states, that is states whose unique action different from error is replace/~i, for some fl~. Suppose. that, for all selections, all alternatives inside a selection are of identical equivalent length; we can then avoid to replace them with a unique item, if we define the equivalent length of that selection as being the equivalent length of one of its alternatives. Such states Q can thus be merged with states Q' such that g(Q, U) = Q'if f(Q, U) = replace fli. The effect of these improvements on the following grammar G3 S---, E-~ E--*T{{+I-}T}* T - - , a {{*l/} a]* is illustrated in Fig. 5(a) and (b) 8. A C O M P A R I S O N
WITH
RELATED
APPROACHES
A class of grammars called Extended LR(k) (ELR(k)) grammars has been defined by Madsen and Kristensen [5], together with a parsing algorithm which derives from Knuth's algorithm [15]. Their approach can be summarized in this way: for each rule A----, ~, a can be split into 61, 62 . . . . . 6, such that a = 6162 ... 6, and each 6i is a selection, an iteration or simply a sequence of grammar symbols. During the construction of the parsing tables, if a state contains an item in the form A ~ 61 • •. 6i__6i+ 1 • - • 6, the corresponding table is completed with a description of 6~, that is the sequence or iteration or selection just parsed. When a reduction according to the rule A ~ c~ is to be made, an instance of :~, say ~', is on the stack, a' can be split into 6'1, 6~. . . . . 6',, where 6'i is an instance of 6i. The popping of ~' from the stack can be performed by popping first 6',, then ¢5'n_ x. . . . and finally 6'1, according to the information stored on the table that, after each pop action, appears on the top of the stack. Since iterations and selections can be nested, each 0i and 61 needs possibly to be split in a similar way. The method respects the two fundamental requirements of bottom-up parsing, i.e. (l) a reduce action replaces (an instance of) the right hand side of a rule with the corresponding left hand side (2) no partial recognizment takes place until the whole handle has been pushed onto the stack. There are however two drawbacks which make it unpractical in compilers for programming languages: (1) the parsing algorithm and the encoding of the parsing tables are complicated by taking account for recursive splitting of the handle during a reduction; moreover the stack must be large enough to record all instances of iterations: the stack increasing can be very high due to the frequency of such iterative rules in programming languages. For example, the rule P R O G R A M ~ begin {STATEMENT]-* end causes the stack to record one item for each statement parsed. (2) The recognition of which instance of a production has been parsed is delayed until the reduction is performed: this inhibits the association of semantic actions to selections and iterations, also because the components of the instance of the rule are popped (i.e. recognized) in backward order with respect to the order in which they appeared in the input text. An investigation of space requirements reveals that our method produces initially larger tables, due to additional states for push and replace actions; a substantial compaction
106
AUGUSTO CELENTANO
can however be made, leading eventually to tables which are smaller than those constructed by Madsen and Krinstensen's method (for grammar G3 their algorithm produces a set of 10 states). As noted in Ref. [31 it is not evident how similar optimizations can be made following their approach. In Ref. [2] LaLonde defined the class of Regular Right Part (RRP) grammars, in which the production right parts are nondeterministic finite state machines (which are equivalent to regular expressions), together with a class of parsers called RRP LR(m, k) parsers, of which context-free LR(k) parsers are a special case. Full details about the parsing algorithm and parser construction can be found [2, 3]; the parsing method resembles that of Madsen and Kristensen: reductions are performed through readback and shiftback states, which allow to scan the stack contents in order to identify the left end of the handle. Remarks similar to those exposed above can be made about this method too: the need for readback states increases the size of the parser, although in this case states compaction is feasible. Readback operations can also increase the parsing time: for the sentence a + a*a*a generated by grammar G3 LaLonde's parser requires 28 moves, while our parser requires 20 moves and only 14 moves are needed if optimized tables are used. As discussed in [2], the form of the grammar has influence on the number of moves required, so the above result cannot be generalized. It is to note however that both Madsen's and LaLonde's methods provide a formal background, precisely defining two classes of grammars and their properties, while the approach illustrated here aims to provide a practical technique rather than a theoretical framework. RRP grammars in particular appear to be superior as a formalism for defining the syntax, as they allow more control over the generated parser, and do not suffer from a number of ambiguities and redundancies in productions right hand sides. 9. CONCLUSION In this paper we have illustrated an LR parsing technique for extended context-free grammars which is simple and suitable for inclusion in one-pass compilers, and in all cases when the parse tree of the input text is not required. The implementation effort is not high, since the algorithms involved are extensions of those known for context-free LR gramm_ars, and large parts of them can be retained. The major advantage of LR with respect to LL techniques is that a larger class of languages can be covered, and efficient algorithms are known for producing optimized parsers. Extending LR techniques to work with extended context-free grammars bypasses some often heavy and useless translation into context-free form; moreover, the possibility of getting a unique practical description of the syntax, suitable for bottom-up and topdown analysis, allows to perform meaningful comparisons between different parsing techniques. 10. SUMMARY Extended context-free grammars (ECFG), which use regular expressions--or equivalent notations--in the right hand side of the productions, are often more clear and natural ways to describe the syntax of programming languages, instead of pure BNF notation. Top-down parsers, mainly recursive descent ones, are quite suitable to deal with this kind of syntactical description. Bottom-up parsers requires major changes with respect to the context-free form; they need to keep track of regular expressions occurrences encountered, in order to correctly reduce the handle. Efficient LR parsers can be constructed for ECFGs, which simulate the behaviour of a conventional LR parser on an equivalent context-free grammar. The existing algorithms for parsing and for parsing tables construction require little extensions to handle regular expressions, and optimizations can be performed as in conventional LR parsers. In this paper such a parsing method is described, along with the algorithms for constructing and optimizing the parsing tables. A comparison with other, more formal
An LR parsing technique for extended context-free grammars
107
proposals appeared in the literature reveals that the proposed technique is practical and suitable for inclusion in translators for programming languages. REFERENCES I. F. L. DeRemer, Extended LR(k) grammars and their parsers, UCSC, California (1970). 2. W. R. LaLonde, Regular right part grammars and their parsers, CACM 20, 731-741 (1977). 3. W. R. LaLonde, Constructing LR parsers for Regular right part grammars, Acta Informat. il, 177-193 I1979). 4. J. Lewi, K. De Wlaminck, J. Huens and M. Huybrechts, Project LILA. The ELL(1) generator: basic principles", Report CW5, Katholieke Universitiet Leuven (1976). 5. O. L. Madsen and B. B. Kristensen, LR parsing of extended context-flee grammars, Acta Informat. 7, 61-73 (1976). 6 D. H. Thompson, The design and implementation of an advanced LALR parse table constructor, Techn. Rep. CSRG-79, University of Toronto (1977). 7. U. Amman, The method of structured programming applied to the development of a compiler, tn International Computing Symposium 73 (Edited by Gunther et al.). North Holland, Amsterdam (1974). 8. D. Gries, Compiler Construction for Digital Computers. Wiley, New York (1971), 9. F. L. DeRemer, Lexical analysis, In Compiler Construction: An Advanced Course (Edited by F. L. Bauer and J. Eickel). Springer Verlag. Berlin (1974). 10. J. Early, An efficient context-free parsing algorithm, CACM 13, 94-102 (1970). 1 l. A. V. Aho and J. D. Ullman, The Theory of Parsing, Translation and Compiling, Vols 1 and 2. Prentice-Hall, Englewood Cliff, NJ (1973). 12. F. L. DeRemer, Simple LR(k) grammars, CACM 14, 453-460 (1971). 13. T. Anderson, J, Eve and J. J. Horning, Efficient LR(1) parsers, Acta lnformat. 2, 12-39 (1971). 14. A. Celentano, Parsing languages described by syntax graphs, In International Computing Symposium 77 (Edited by D. Ribbens and E. Mortet), pp. 227-235. North Holland, Amsterdam (1977). 15. D. E. Knuth, On the translation of languages from left to right, Information Control 8, 607-639 (1965). About the Author--AuGusTo CELENTANO was born in Milano, Italy, in 1950. He received his doctoral degree in Electrical Engineering from the Politecnico di Milano in 1973. Since 1974 he has worked as a Research Assistant at the Politecnico di Milano. His main research interests include compiler construction, programming languages and software engineering. He is the author of several publications in those fields.