Information Processing North-Holland
Letters
41(1992)
14 February
87-91
1992
A recursive ascent Earley parser Red
Leermakers
Philips Research Laboratories, P. 0. Box 80.000, 5600 JA Eindhouen, Netherlands Communicated by D. Gries Received 7 August 1991 Revised 9 October 1991 Keywords: Functional
programming,
language
processors
1. Introduction
Recently, the theory of LR-parsing gained new impetus by the discovery of the recursive ascent implementation technique for deterministic [2] and nondeterministic [3,4] LR-parsers. In short, the novelty is that LR-parsers can be implemented purely functionally and that this implementation has very simple correctness proofs. In its primary form, a recursive ascent parser consists of two functions for each state. One of the major application areas of the Earley algorithm [l] is the field of computational linguistics. A few years ago, Tomita [5] has proposed a nondeterministic LR-parser to parse natural languages. This parser has stirred some research to establish the differences between the Earley and Tomita parsers. One of the results of this research was the insight that the Earley algorithm can be seen as a shift-reduce parser and as such fits in a large family of parsers to which also nondeterministic LR-parsers belong. A consequence of this insight is that implementation techniques for LR-parsers can also be applied to the Earley algorithm. In particular, this can be done with the recursive ascent technique, and this is the subject of this note.
2. The algorithm Consider CF grammar G, with terminals Vr and nonterminals V,. Let I/= V, U VT. Recursive ascent parsers consist of a number of functions. Let us start with associating a function to each item. Items, grammar rules with a dot somewhere in the right-hand side, are used ubiquitously in parsing theory to denote a partially recognized rule. A typical item is written as A -+ (Y.& with Greek letters for arbitrary elements of I/*. We use the unorthodox embracing operator [ -1 to map each item to its function: [A -+(Y.P]:N~2N, where N is the set of integers, or a subset 0.. . II,,, with IZ,,, the maximum sentence length, and 2N is the powerset of N. The functions are to meet the following specification:
with xi.,.
x, the string to be parsed. So, the function reports which parts of the string can be derived
0020-0190/92/$05.00
0 1992 - Elsevier
Science
Publishers
B.V. All rights reserved
87
Volume
41, Number
2
INFORMATION
PROCESSING
LETTERS
14 February
from /3 starting from position i. Below we will find a constructive definition viewed as a functional implementation. If we add a grammar rule S’ + s+...x, is equivalent to n E [S’ + .Sl(O), so that the algorithm is [S’ + .S](O). To be able to construct an implementation of [A + cu.@] that’ has no grammars, we need so-called predict sets. Let predict(A + a$> be the derived from A + CY.~ by the closure operation:
1992
for this function that can be S to G, with S’ P I/, then to be invocated by calling problems with left-recursive set of initial items that are
predict(A~a.p)={B-f.~lB~cLAP~By}. The double
arrow
=
denotes
a left-most-symbol
rewriting
with a non-e
grammar
rule, i.e.
“jP~3B,s(‘Y=ByAP=6yAB~6A6#&). A recursive [A + a$],
ascent
recognizer
may be obtained
but also a function
by relating
that we take to be the result
to each item of applying
A + a.P
operator
not only the above
[ *] to the item:
[A ~c4:vxN-,2N. It has the specification
(X E I’)
[A-,a.~](X,i)=(i13,:~tX~Ay~~j+I...~j). Assuming hence
x, + 1 to be some end of sentence
marker
that is not in V, it can never
[A -+ LY./~](x,,+ i, n + 1) = @. For i < IE the above functions
are recursively
p s x,,
implemented
+a.PI(B,
(
u j13,:B-,.EEprediCt(A~(y.p)AjE[A
be that
i,)
u{iIP=E}, [ba.p](X,i)={j13,:@=XyAjE[baX.y](i)} U
(
jl!lcsk:
jE[A~c-u.p](C,k)AC~.X6Epredict(A~cy.P)
Ak E [C+X.G](i)). The correctness First we notice p++,...
of this implementation
is shown by the following
that xj = 3, p ++iy ( ‘EIBy
(
A ~~)x~+~...x~
1
P~B~AB~EA~-T,~~+~...x~)
V(p=eAi=j).
Substituting
this in the specification
of [A + cr.p] one gets
[~+a.~](i)=(j~+,:~fxi+,yAy~~i+2...xj) U
(
jl3,,:
U{jlP=&Ai=j}.
88
B-EA@B~A~-T,x~+~...x~
proof:
by
iy,
Volume
INFORMATION
41, Number 2
PROCESSING
14 February 1992
LETTERS
This directly leads to the implementation given above, using the specifications of [A --f (Y.P](x, + i, i + 11, [A + a./?](B, i> and predictU + (~.p). For establishing the correctness of [A + (~./3] notice that /3 SXxy either consists of zero steps, in which case p =Xy, or it contains at least one step: 3y(p
~xXyAy-T,xi+,...Xj
)
+2=x#+xi+,...xj) V3C6yk
(
p~CCrACjX6AS-T,Xi+l...XkAY-T)Xk+l...Xj
1
.
Hence, [A + c~.p](X, i> may be written as the union of two sets, S, and S,: S,=(jlT:
/?=XrA-ysxi+i...xj), ~~C~AC-XSAS~X~+~...X,A~~X~+,...X~
S, = ( j13c8,k:
With the specification of [A + aX.y],
I
.
S, may be rewritten as
S,=(j13,,:/3=XyAj~[A-+aX.y](i)}.
The set S, may be rewritten using the specifications of [A + a.p](C, S,=(j13C6k:
k) and predict(A + (Y.P):
jE[A~a.P](C,k)AC~.X6Epredict(A~cu.P)A6-*t~~+,...~~).
With the definition of [C +X.81 one finally gets: S,=(jl3,,,:
jE[A*c-u.P](C,
k)r\C~.X6Epredict(A~a.p)AkE[C~K.6](i)). cl
3. Complexity
and variants
The above recognizer needs exponential time resources unless the functions are implemented as memo-functions. Memo-functions memorize for which arguments they have been called. If a function is called with the same arguments as before, the function returns the previous result without recomputing it. In conventional programming languages memo-functions are not available, but they can easily be implemented. The use of memo-functions obsoletes the introduction of devices like parse matrices [ll. The worst-case complexity analysis of the memo-ized recognizer is quite simple. Let IZ be the sentence length, I G I the number of items of the grammar, p the maximum number of different left-hand sides in a predict set (bounded by the number of nonterminals), and q the maximum number of items in a predict set with the same symbol after the dot. Then, there are O( I G I pn> different invocations of recognizer functions. Each invocation of a function [I] invocates O(qn) other functions, that all result in a set with O(n) elements. The merge of these sets into one set with no duplicates can be accomplished in 0(qn2> time on a random access machine. Hence, the total time-complexity is O( I G I pqn3>. The space needed for storing. function results is O(n) per invocation, i.e. O( I G I pn2) for the whole recognizer. These complexity results are almost identical to the usual ones for Earley parsing. Only the dependence on the grammar variables 1G 1, p and q slightly differs. Worst-case complexities need not be relevant in 89
Volume
41, Number 2
INFORMATION
PROCESSING
LETTERS
14 February 1992
practice. We claim that for many practical grammars the present algorithm is more efficient than existing implementations, for the following reason. The above formulae can be interpreted as to define two functions [ -1 and [*I, that will work for variable grammars and strings. This view is convenient when building prototypes. If efficiency is an issue, however, one should precompute as much as possible and actually create, for a fixed grammar, the functions [I] and [I] for every item I. In the terminology of functional programming the functions [*I and n are to be evaluated partially for each item. In this way, the grammar is compiled into a collection of functions, just like conventional parser generators compile a grammar into LR-tables or a recursive descent parser. Quite some work that is done at parse time by the standard Earley parser, such as the creation of predict sets and the processing of item sets, is transferred to compile time when transforming the grammar into a functional parser. As a consequence, the compiled parser is more efficient than the standard implementations of the Earley parser. The above considerations only hold if our algorithm terminates. If the grammar has a cyclic derivation A A,A, the execution of [ Z]U, i) leads to a call of itself, and the algorithm does not terminate. Also, there may be a cycle of transitions labeled by nonterminals that derive E. This occurs if for some k one has that for i = 1.. . k --j
Ai+l
Epredict( .%+1Pi+1
while A, =Ak+l
Ai += a,./3,) A ai f E
ACY, =(Y~+~ A& =Pk+l.
Then the execution of [A, -+ aI.pI](i) leads to a call of itself, and the algorithm does not terminate. A cycle of this form occurs iff there is a derivation A -$ @A@ such that LY5 E. It is easy, however, to define a variant of the recognizer that has no problems with these derivations. It is obtained from dropping the restriction that the left-most symbol derivation * may not use E-rules. Then this paper’s analysis can be repeated. The redefinition of j affects the function predict and the way p Axi+l . . . xj is to be expressed in terms of * : p-Tjxi+l...xj=3r
(
p
* jxi+lY
A
Y “i+
z...xj)
Also
the decomposition of p AXy must be rephrased: it is not. The result is the following recognizer: [A+a.p](i) [A + a$]( .=
(Tjl
=[A
‘(Y.p](xi+l,
V (p f~
Ai=j).
X is introduced by a grammar rule C + pX6 or
i+ 1) U (iIP $&},
X, i) .:P=PXYAP
U jl%,k: (
j E [A
L~AjE[A+apx.-y](i)) + a$](C,
k) A C + .pXS Epredict( A + a$) A/_&AkE[C’pX.6](i)).
To conclude, the above offers attractive alternatives to the standard Earley parser. Only for cyclic grammars there is a problem with termination. For other grammars one gains efficiency. The above recognition algorithms are not the most efficient ones, however. For instance, there is an additional improvement following from the observation that [A + a./31 and [A + a.~] only depend on ~3. Therefore, functions for different items may be identical and can be identified. A similar improvement is 90
INFORMATION
Volume 41, Number 2
applicable to many implementations improvement is
jl&s,k:
with predict(p) = {B + .p IB + p
14 February 1992
PROCESSING LETTERS
of the Earley algorithm. A recursive ascent recognizer
jE[p](C,
A p 5
k) AC+
.pXiSEpredict(/3)
Ap
with this
SE A k E [S](i)),
By}. The algorithm is to be invocated by calling [S](O).
References [l] J.C. Earley, An efficient context-free parsing algorithm, Comm. ACM 13 (2) (1970) 94-102. [2] F.E.J. Kruseman Aretz, On a recursive ascent parser, Iform. Process. Left. 29 (1988) 201-206. [3] R. Leermakers, Non-determynistic recursive ascent parsing, in: Proc. 5th Conf of the European Chapter of the Association of Computational Linguistics, 1991.
[4] R. Leermakers, I. Augusteijn and F.E.J. Kruseman Aretz, A functional LR-parser, Theoret. Comput. Sci., accepted for publication. [5] M. Tomita, Efficient Parsing for Natural Language, A Fast Algotithm for Practical Systems (Kluwer Academic Publishers, Dordrecht, 1986).
91