A recursive ascent Earley parser

A recursive ascent Earley parser

Information Processing North-Holland Letters 41(1992) 14 February 87-91 1992 A recursive ascent Earley parser Red Leermakers Philips Research ...

321KB Sizes 4 Downloads 162 Views

Information Processing North-Holland

Letters

41(1992)

14 February

87-91

1992

A recursive ascent Earley parser Red

Leermakers

Philips Research Laboratories, P. 0. Box 80.000, 5600 JA Eindhouen, Netherlands Communicated by D. Gries Received 7 August 1991 Revised 9 October 1991 Keywords: Functional

programming,

language

processors

1. Introduction

Recently, the theory of LR-parsing gained new impetus by the discovery of the recursive ascent implementation technique for deterministic [2] and nondeterministic [3,4] LR-parsers. In short, the novelty is that LR-parsers can be implemented purely functionally and that this implementation has very simple correctness proofs. In its primary form, a recursive ascent parser consists of two functions for each state. One of the major application areas of the Earley algorithm [l] is the field of computational linguistics. A few years ago, Tomita [5] has proposed a nondeterministic LR-parser to parse natural languages. This parser has stirred some research to establish the differences between the Earley and Tomita parsers. One of the results of this research was the insight that the Earley algorithm can be seen as a shift-reduce parser and as such fits in a large family of parsers to which also nondeterministic LR-parsers belong. A consequence of this insight is that implementation techniques for LR-parsers can also be applied to the Earley algorithm. In particular, this can be done with the recursive ascent technique, and this is the subject of this note.

2. The algorithm Consider CF grammar G, with terminals Vr and nonterminals V,. Let I/= V, U VT. Recursive ascent parsers consist of a number of functions. Let us start with associating a function to each item. Items, grammar rules with a dot somewhere in the right-hand side, are used ubiquitously in parsing theory to denote a partially recognized rule. A typical item is written as A -+ (Y.& with Greek letters for arbitrary elements of I/*. We use the unorthodox embracing operator [ -1 to map each item to its function: [A -+(Y.P]:N~2N, where N is the set of integers, or a subset 0.. . II,,, with IZ,,, the maximum sentence length, and 2N is the powerset of N. The functions are to meet the following specification:

with xi.,.

x, the string to be parsed. So, the function reports which parts of the string can be derived

0020-0190/92/$05.00

0 1992 - Elsevier

Science

Publishers

B.V. All rights reserved

87

Volume

41, Number

2

INFORMATION

PROCESSING

LETTERS

14 February

from /3 starting from position i. Below we will find a constructive definition viewed as a functional implementation. If we add a grammar rule S’ + s+...x, is equivalent to n E [S’ + .Sl(O), so that the algorithm is [S’ + .S](O). To be able to construct an implementation of [A + cu.@] that’ has no grammars, we need so-called predict sets. Let predict(A + a$> be the derived from A + CY.~ by the closure operation:

1992

for this function that can be S to G, with S’ P I/, then to be invocated by calling problems with left-recursive set of initial items that are

predict(A~a.p)={B-f.~lB~cLAP~By}. The double

arrow

=

denotes

a left-most-symbol

rewriting

with a non-e

grammar

rule, i.e.

“jP~3B,s(‘Y=ByAP=6yAB~6A6#&). A recursive [A + a$],

ascent

recognizer

may be obtained

but also a function

by relating

that we take to be the result

to each item of applying

A + a.P

operator

not only the above

[ *] to the item:

[A ~c4:vxN-,2N. It has the specification

(X E I’)

[A-,a.~](X,i)=(i13,:~tX~Ay~~j+I...~j). Assuming hence

x, + 1 to be some end of sentence

marker

that is not in V, it can never

[A -+ LY./~](x,,+ i, n + 1) = @. For i < IE the above functions

are recursively

p s x,,

implemented

+a.PI(B,

(

u j13,:B-,.EEprediCt(A~(y.p)AjE[A

be that

i,)

u{iIP=E}, [ba.p](X,i)={j13,:@=XyAjE[baX.y](i)} U

(

jl!lcsk:

jE[A~c-u.p](C,k)AC~.X6Epredict(A~cy.P)

Ak E [C+X.G](i)). The correctness First we notice p++,...

of this implementation

is shown by the following

that xj = 3, p ++iy ( ‘EIBy

(

A ~~)x~+~...x~

1

P~B~AB~EA~-T,~~+~...x~)

V(p=eAi=j).

Substituting

this in the specification

of [A + cr.p] one gets

[~+a.~](i)=(j~+,:~fxi+,yAy~~i+2...xj) U

(

jl3,,:

U{jlP=&Ai=j}.

88

B-EA@B~A~-T,x~+~...x~

proof:

by

iy,

Volume

INFORMATION

41, Number 2

PROCESSING

14 February 1992

LETTERS

This directly leads to the implementation given above, using the specifications of [A --f (Y.P](x, + i, i + 11, [A + a./?](B, i> and predictU + (~.p). For establishing the correctness of [A + (~./3] notice that /3 SXxy either consists of zero steps, in which case p =Xy, or it contains at least one step: 3y(p

~xXyAy-T,xi+,...Xj

)

+2=x#+xi+,...xj) V3C6yk

(

p~CCrACjX6AS-T,Xi+l...XkAY-T)Xk+l...Xj

1

.

Hence, [A + c~.p](X, i> may be written as the union of two sets, S, and S,: S,=(jlT:

/?=XrA-ysxi+i...xj), ~~C~AC-XSAS~X~+~...X,A~~X~+,...X~

S, = ( j13c8,k:

With the specification of [A + aX.y],

I

.

S, may be rewritten as

S,=(j13,,:/3=XyAj~[A-+aX.y](i)}.

The set S, may be rewritten using the specifications of [A + a.p](C, S,=(j13C6k:

k) and predict(A + (Y.P):

jE[A~a.P](C,k)AC~.X6Epredict(A~cu.P)A6-*t~~+,...~~).

With the definition of [C +X.81 one finally gets: S,=(jl3,,,:

jE[A*c-u.P](C,

k)r\C~.X6Epredict(A~a.p)AkE[C~K.6](i)). cl

3. Complexity

and variants

The above recognizer needs exponential time resources unless the functions are implemented as memo-functions. Memo-functions memorize for which arguments they have been called. If a function is called with the same arguments as before, the function returns the previous result without recomputing it. In conventional programming languages memo-functions are not available, but they can easily be implemented. The use of memo-functions obsoletes the introduction of devices like parse matrices [ll. The worst-case complexity analysis of the memo-ized recognizer is quite simple. Let IZ be the sentence length, I G I the number of items of the grammar, p the maximum number of different left-hand sides in a predict set (bounded by the number of nonterminals), and q the maximum number of items in a predict set with the same symbol after the dot. Then, there are O( I G I pn> different invocations of recognizer functions. Each invocation of a function [I] invocates O(qn) other functions, that all result in a set with O(n) elements. The merge of these sets into one set with no duplicates can be accomplished in 0(qn2> time on a random access machine. Hence, the total time-complexity is O( I G I pqn3>. The space needed for storing. function results is O(n) per invocation, i.e. O( I G I pn2) for the whole recognizer. These complexity results are almost identical to the usual ones for Earley parsing. Only the dependence on the grammar variables 1G 1, p and q slightly differs. Worst-case complexities need not be relevant in 89

Volume

41, Number 2

INFORMATION

PROCESSING

LETTERS

14 February 1992

practice. We claim that for many practical grammars the present algorithm is more efficient than existing implementations, for the following reason. The above formulae can be interpreted as to define two functions [ -1 and [*I, that will work for variable grammars and strings. This view is convenient when building prototypes. If efficiency is an issue, however, one should precompute as much as possible and actually create, for a fixed grammar, the functions [I] and [I] for every item I. In the terminology of functional programming the functions [*I and n are to be evaluated partially for each item. In this way, the grammar is compiled into a collection of functions, just like conventional parser generators compile a grammar into LR-tables or a recursive descent parser. Quite some work that is done at parse time by the standard Earley parser, such as the creation of predict sets and the processing of item sets, is transferred to compile time when transforming the grammar into a functional parser. As a consequence, the compiled parser is more efficient than the standard implementations of the Earley parser. The above considerations only hold if our algorithm terminates. If the grammar has a cyclic derivation A A,A, the execution of [ Z]U, i) leads to a call of itself, and the algorithm does not terminate. Also, there may be a cycle of transitions labeled by nonterminals that derive E. This occurs if for some k one has that for i = 1.. . k --j

Ai+l

Epredict( .%+1Pi+1

while A, =Ak+l

Ai += a,./3,) A ai f E

ACY, =(Y~+~ A& =Pk+l.

Then the execution of [A, -+ aI.pI](i) leads to a call of itself, and the algorithm does not terminate. A cycle of this form occurs iff there is a derivation A -$ @A@ such that LY5 E. It is easy, however, to define a variant of the recognizer that has no problems with these derivations. It is obtained from dropping the restriction that the left-most symbol derivation * may not use E-rules. Then this paper’s analysis can be repeated. The redefinition of j affects the function predict and the way p Axi+l . . . xj is to be expressed in terms of * : p-Tjxi+l...xj=3r

(

p

* jxi+lY

A

Y “i+

z...xj)

Also

the decomposition of p AXy must be rephrased: it is not. The result is the following recognizer: [A+a.p](i) [A + a$]( .=

(Tjl

=[A

‘(Y.p](xi+l,

V (p f~

Ai=j).

X is introduced by a grammar rule C + pX6 or

i+ 1) U (iIP $&},

X, i) .:P=PXYAP

U jl%,k: (

j E [A

L~AjE[A+apx.-y](i)) + a$](C,

k) A C + .pXS Epredict( A + a$) A/_&AkE[C’pX.6](i)).

To conclude, the above offers attractive alternatives to the standard Earley parser. Only for cyclic grammars there is a problem with termination. For other grammars one gains efficiency. The above recognition algorithms are not the most efficient ones, however. For instance, there is an additional improvement following from the observation that [A + a./31 and [A + a.~] only depend on ~3. Therefore, functions for different items may be identical and can be identified. A similar improvement is 90

INFORMATION

Volume 41, Number 2

applicable to many implementations improvement is

jl&s,k:

with predict(p) = {B + .p IB + p

14 February 1992

PROCESSING LETTERS

of the Earley algorithm. A recursive ascent recognizer

jE[p](C,

A p 5

k) AC+

.pXiSEpredict(/3)

Ap

with this

SE A k E [S](i)),

By}. The algorithm is to be invocated by calling [S](O).

References [l] J.C. Earley, An efficient context-free parsing algorithm, Comm. ACM 13 (2) (1970) 94-102. [2] F.E.J. Kruseman Aretz, On a recursive ascent parser, Iform. Process. Left. 29 (1988) 201-206. [3] R. Leermakers, Non-determynistic recursive ascent parsing, in: Proc. 5th Conf of the European Chapter of the Association of Computational Linguistics, 1991.

[4] R. Leermakers, I. Augusteijn and F.E.J. Kruseman Aretz, A functional LR-parser, Theoret. Comput. Sci., accepted for publication. [5] M. Tomita, Efficient Parsing for Natural Language, A Fast Algotithm for Practical Systems (Kluwer Academic Publishers, Dordrecht, 1986).

91