On different LL and LR parsers used in LLLR parsing

On different LL and LR parsers used in LLLR parsing

Accepted Manuscript On different LL and LR parsers used in LLLR parsing Boˇstjan Slivnik PII: DOI: Reference: S1477-8424(16)30185-3 10.1016/j.cl.201...

1MB Sizes 4 Downloads 84 Views

Accepted Manuscript

On different LL and LR parsers used in LLLR parsing Boˇstjan Slivnik PII: DOI: Reference:

S1477-8424(16)30185-3 10.1016/j.cl.2017.06.002 COMLAN 261

To appear in:

Computer Languages, Systems & Structures

Received date: Revised date: Accepted date:

5 December 2016 3 May 2017 9 June 2017

Please cite this article as: Boˇstjan Slivnik, On different LL and LR parsers used in LLLR parsing, Computer Languages, Systems & Structures (2017), doi: 10.1016/j.cl.2017.06.002

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights • The LLLR parser can be constructed for any LR grammar. • The LLLR parser produces the left parse ot the input string without backtracking.

CR IP T

• The canonical LL or the LALL parser can be used as the backbone.

AC

CE

PT

ED

M

AN US

• Small canoncial or LALR parsers are used for LL conflict resolution.

1

ACCEPTED MANUSCRIPT

On different LL and LR parsers used in LLLR parsing Boˇstjan Slivnik

CR IP T

University of Ljubljana, Faculty of Computer and Information Science Veˇ cna pot 113, SI-1000 Ljubljana, Slovenia

Abstract

As described in [1], LLLR parsing is a method that parses as much of its input string as possible using the backbone SLL(k) parser and uses small embedded canonical left LR(k)

AN US

parsers to resolve LL conflicts. Once the LL conflict is resolved, the embedded parser

produces the left parse of the substring it has just parsed and passes the control back to the backbone parser together with the information about how the backbone parser should realign its stack as a part of the input has been read by the embedded parser. The LLLR parser produces the left parse of the input string without any backtracking and,

M

if used for a syntax-directed translation, it evaluates semantic actions using the same top-down strategy as the canonical LL(k) parser. In this paper, a more general approach towards LLLR parsing is presented as it is described how any kind of canonical LL(k)

ED

or LA(k)LL(k 0 ) parser can be used as the backbone parser and how different kinds of embedded canonical left LR(k) or left LA(k)LR(k 0 ) parsers can be used for LL conflict

PT

resolution.

CE

Keywords: LL parsing, LR languages, left parse

1. Introduction

AC

The two main approaches for parsing programming languages are the top-down and

the bottom-up parsing. LL parsers represent the core of the first group and LR parsers the core of the second. The top-down parsing provides “the readability of recursive descent (RD) implementations of LL parsing along with the ease of semantic action incorporation” while “an LL parser is linear in the size of the grammar”; the bottom-up

Email address: [email protected] (Boˇstjan Slivnik) Preprint submitted to Computer Languages, Systems and Structures

June 21, 2017

ACCEPTED MANUSCRIPT

parsing is regarded highly for “the extended parsing power of LR parsers, in particular the admissibility of left recursive grammar” but “even LR(0) parse tables can be exponential in the size of the grammar” (all quotes are from [2]). The syntax-directed translation represents the pillar of virtually any modern com-

CR IP T

piler’s front-end and thus the implementation of the syntax analysis forms an important part of any compiler. Hence, the parser should provide the compiler writer with (a) a solid framework for evaluation of semantic actions during or after parsing and (b) an appropriate support for producing detailed syntax error reports. An important factor

contributing to the wide use of LL and LR parsers is the straightforward evaluation of semantic routines employed by either group. However, parsers incorporating both

AN US

top-down and bottom-up parsing, e.g., left-corner parsers [3, 4, 5], never gained much popularity because of the confusing order in which the semantic actions are triggered. Much older XLC(1) and LAXLC(1) parsers [4] are extensions of left-corner parsing and (like LLLR parsing) employ both top-down and bottom-up approaches, but they

produce neither the left nor right parse of the input string. In general, during XLC(1)

M

and LAXLC(1) parsing the evaluation of semantic routines might become quite confusing as it changes direction all the time.

ED

Example 1. For instance, if string abbbbccc is parsed using an XLC(1) or an LAXLC(1) parser for grammar with the start symbol S and productions

CE

the output

PT

[S −→ A],[A −→ aBbC],[B −→ Bb],[B −→ b],[C −→ Cc], and [C −→ c],

[S −→ A][A −→ aBbC][B −→ b][B −→ Bb][B −→ Bb][C −→ c][C −→ Cc][C −→ Cc]

AC

is neither left nor (reverse) right parse of the input string [4]. As the first production printed out expands the start symbol at the top of the derivation tree, the evaluation of semantic routines starts at the top. But then the evaluation of semantic routines suddenly changes from the top-down into the bottom-up approach since the second production printed out produces the leaf several levels below the root of the tree. Later the evaluation changes direction from bottom-up to top-down and vice-versa a few more times. 3



ACCEPTED MANUSCRIPT

In recent years the focus has shifted strongly towards the top-down parsing and towards LL parsing in particular [6, 7, 8, 2, 9, 10, 11] — even to the point that yacc was erroneously considered dead [12]. However, GLL parser capable of parsing a language of any context-free grammar might require cubic time. An LL(∗ ) parser using (a) finite

CR IP T

automata to resolve certain LL(k) conflicts and (b) backtracking to resolve the others, has been implemented in antlr [9]. Furthermore, “the LL(∗ ) grammar condition is

statically undecidable and grammar analysis sometimes fails to find regular expressions that distinguish between alternative productions” [13].

On the other hand, LR parsers were modified to produce the left parse of its input and

thus giving the compiler writer the impression of top-down parsing [6, 14, 7]. However,

AN US

these parsers either (a) produce the first production of the left parse (and thus trigger the

first semantic actions) only after the entire input string has been parsed [6] or (b) they further increase the difference of how much space LR or LL parser needs by introducing two additional parsing tables [7]. Much stronger GLR parsing can also be used but the time complexity is O(np + 1) (where p is the length of the longest production). For

M

deterministic grammars, however, both GLL and GLR parsers should run in near-linear time [13].

Finally, ALL(∗ ) parsing performs grammar analysis in parse-time to combine “the

ED

simplicity, efficiency, and predictability of conventional top-down LL(k) parsers with the power of a GLR-like mechanism to make parsing decisions” [13]. It achieves almost linear

PT

time for majority of grammars used in practice where it is consistently faster than GLL and GLR parsing, and can compete with LL or LALR parsing. Nevertheless, its worst case complexity is O(n4 ).

CE

The idea behind LLLR parsing, as introduced in [15] and described in [1], is to

produce the left parse of the input string by using LL parsing as much as possible and

AC

to use small left LR parsers only to avoid LL conflicts. Left LR(k) parsers are LR(k) parsers that produce the left parse instead of the rihgt parse. As described in [1], LLLR parsing uses the canonical LR(k) parsers within an SLL(k) parser in contrast to [16] or LL(∗ ) and ALL(∗ ) parsers use deterministic finite automata “even though lookahead languages (set of all possible remaining input phrases) are often context-free” [13]. In fact, LLLR parsing is strong enough for parsing any language generated by an LR(k) 4

ACCEPTED MANUSCRIPT

grammar, although not every LR(k) grammar is particularly well suited for it [1]. The paper is organized in the way an LLLR parser is constructed. Section 2 includes the presentation of the notation used throughout the paper — an experienced reader might skip it. Section 3 introduces different kinds of conflicts that can appear during

CR IP T

LLLR parsing and the construction of the backbone LL parsers. It somewhat focuses on

the canonical LL(k) parser but avoids any of its particular characteristics so the same

method, with a minor modification, is valid for LA(k)LR(k 0 ) parsers as well. Section 4 describes four different kinds of embedded left LR parsers in a uniformed manner. Each of these LR parsers can be used within the backbone parser described in Section 3.

Section 5, before Conclusion, is devoted to ensuring the LLLR parser does not include

AN US

unneeded parts.

2. Preliminaries

An intermediate knowledge of LL and LR parsing is presumed. The standard terminology of formal language theory and parsing is assumed [17, 18]. For a reader more

M

familiar with the notation introduced in [19, 20] a brief and very compact overview (but not a detailed explanation) is provided in this section.

ED

Apart from the start state and the set of final states, the finite automaton is described by a finite set of transitions i X → i0 (where X is a single input symbol or ε): upon

reading X, the automaton moves from state i to state i0 . Likewise, apart from the initial

PT

stack contents and the set of final states, the pushdown automaton can be described by a finite set of transitions of the form X1 X2 . . . Xn uv → Y1 Y2 . . . Ym v. Each such

CE

transition reads a prefix u of the yet unread part of the input string and replaces n stack symbols X1 , X2 , . . . , Xn on the top of the stack with m stack symbols Y1 , Y2 , . . . , Ym (the topmost symbols are Xn and Ym , before and after the transition). No state is required

AC

if the pushdown automaton is defined in this manner. Let G = hN, T, P, Si be a context-free grammar. The sets N and T , where |N |, |T | <

∞ and N ∩ T = ∅, contain nonterminal and terminal symbols, respectively. The set

P ⊆ N × (N ∪ T )∗ , where |P | < ∞, is a set of productions and S ∈ N denotes the start symbol. The rightmost (leftmost) derivation, denoted by =⇒rm (=⇒lm ), is a derivation where 5

ACCEPTED MANUSCRIPT

the rightmost (leftmost) nonterminal symbol of a sentential form is replaced by the right side of a production expanding it. Hence, ψAw =⇒rm ψαw and wAψ =⇒lm wαψ iff w ∈ T ∗ , ψ ∈ (N ∪ T )∗ and [A −→ α] ∈ P . Let =⇒∗rm and =⇒∗lm denote multiple applications, i.e., transitive closures, of =⇒rm and =⇒lm , respectively.

CR IP T

$-augmented grammar for G is a grammar G0 = hN ∪{S 0 }, T ∪{$}, P ∪[S 0 −→ $S$], S 0 i

where S 0 , $ 6∈ N ∪ T . Let k : x denote string of first k symbols of x (or x itself if |x| ≤ k) ∗ and let T ∗k = {k : x | x ∈ T ∗ }. Let FIRSTG k (α) = {k : x | α =⇒ x}.

2.1. LR parsing

Assume G is $-augmented grammar, i.e., containing S 0 and $..

AN US

String γ ∈ (N ∪ T )∗ is a viable prefix of G and [A → α • β, x] is an LR(k) item

valid for γ iff S 0 =⇒∗rm γ 0 Av =⇒rm γ 0 αβv = γβv and x = k : v. An LR(k) item

i = [A → α • β, x] is active for y ∈ T ∗k , i.e., i ∈ ACTIVE(y), iff y ∈ FIRSTG k (βx). Let TRUNCk0 (q) = {[A → α • β, k 0 : x] | [A → α • β, x] ∈ q}. LR(k 0 )-equivalence of LR(k) items, where 0 ≤ k 0 ≤ k, is defined as i =k0 i0 iff TRUNCk0 ({i}) = TRUNCk0 ({i0 }).

The nondeterministic LR(k) machine for G is a finite semiautomaton whose states

M

are valid LR(k) items of G, with the initial item [S 0 → • $S$, ε] and transitions i ε → i0

0 if i desc i0 (where [A → α • Bβ, x] desc [B → •ω, y] iff y ∈ FIRSTG k (βx)) and i X → i

ED

if i passes-X i0 (where [A → α • Xβ, x] passes-X [A → αX • β, x]).

G G Let VALIDG LR(k) (γ) = CoreLR(k) (γ)∪ClosLR(k) (γ) denote the set of all LR(k) items of

G valid for a viable prefix γ; CoreG LR(k) (γ) contains items derived by passes-X (and the

PT

G initial item if γ = ε), ClosG LR(k) (γ) contains items derived by desc . Thus CoreLR(k) (γ) ∩

ClosG LR(k) (γ) = ∅.

CE

Let relations ρLR(k) and ρLA(k)LR(k0 ) denote LR(k)- and LA(k)LR(k 0 )-equivalence

of viable prefixes, respectively. They are defined as γ ρLR(k) γ 0 iff VALIDG LR(k) (γ) =k G G 0 0 0 VALIDG LR(k) (γ ), and γ ρLA(k)LR(k0 ) γ iff VALIDLR(k) (γ) =k0 VALIDLR(k) (γ ).

AC

The canonical LR(k) machine for G is a finite semiautomaton with a set of states

consisting of equivalence classes of viable prefixes in regard to ρLR(k) (denoted by [γ]LR(k)

for some viable prefix γ), initial state [ε]LR(k) and transitions [γ]LR(k) X → [γX]LR(k)

for all viable prefixes γ and γX of G. The LA(k)LR(k 0 ) machine for G is a finite semiautomaton with a set of states consisting of equivalence classes of viable prefixes in regard to ρLA(k)LR(k0 ) (denoted 6

ACCEPTED MANUSCRIPT

S 0 → • $S$, ε

A → a •, d

A → a •, b

$

S → b • Ad, $

S 0 → $ • S$, ε

b

S → • bAd, $

A → • a, d

S → • dAb, $

A

S

S → bA • d, $

S 0 → $S • $, $

d

$

S → bAd •, $

S 0 → $S$ •, $

a d

S → d • Ab, $ A → • a, b$

CR IP T

a

A

S → dA • b, $ b

S → dAb •, $

AN US

Figure 1: The canonical LR(1) machine for G0 (the initial state [ε] is shown in the middle of the top row). The LA(1)LR(0) machine is obtained by merging states [$ba] (top left) and [$da] (top right).

by [γ]LA(k)LR(k0 ) for some viable prefix γ), initial state [ε]LA(k)LR(k0 ) and transitions [γ]LA(k)LR(k0 ) X → [γX]LA(k)LR(k0 ) for all viable prefixes γ and γX of G.

M

Usually subscripts and superscripts are omitted if understood from the context. Example 2. Consider a $-augmented grammar G0 with productions

ED

[S 0 −→ $S$],[S −→ bAd],[S −→ dAb], and [A −→ a]. The canonical LR(1) machine is shown in Figure 1. If states [$ba] and [$da] are merged, 

PT

the LA(1)LR(0) machine is obtained.

A pushdown automaton with a set of transitions containing shift and reduce ac-

CE

tions is called the canonical LR(k) parser or the LA(k)LR(k 0 ) parser depending on what equivalence classes of viable prefixes are used, i.e., [·]LR(k) or [·]LA(k)LR(k0 ) . Reduce actions are transitions [γ][γX1 ][γX1 X2 ] . . . [γX1 X2 . . . Xn ] y → [γ][γA] y where

AC

[A → X1 X2 . . . Xn • , y] ∈ VALIDG LR(k) (γX1 X2 . . . Xn ). Shift actions are transitions

G [γ] ay → [γ][γa] y iff [A → α • aβ, z] ∈ VALIDG LR(k) (γ) and ay ∈ FIRSTk (aβz). LALR

parsers generated by tools like yacc are actually LA(1)LR(0) parsers (with minor modifications to enhance error recovery). Example 3. The canonical LR(1) parser based on the canonical LR(1) machine shown 7

ACCEPTED MANUSCRIPT

in Figure 1 contains shift actions [$] b →

[$][$b] ε

[$] d →

[$][$d] ε

[$b] a →

[$b][$ba] ε

[$d] a →

[$d][$da] ε

[$dA] b → [$dA][$dAb] ε

CR IP T

[$bA] d → [$bA][$bAd] ε and reduce actions [$][$b][$bA][$bAd] $ →

[$][$S] $

[$][$d][$dA][$dAb] $ →

[$b][$ba] d → [$b][$bA] d

[$][$S] $

[$d][$da] b → [$d][$dA] b

Parsing of the input string bad is therefore carried out as follows: transition

AN US

stack input

shift b : [$] b → [$][$b] ε

$[$] bad$

shift a : [$b] a → [$b][$ba] ε

$[$][$b] ad$ $[$][$b][$ba] d$

reduce [A −→ a] : [$b][$ba] d → [$b][$bA] d

$[$][$b][$bA] d$

shift d : [$bA] d → [$bA][$bAd] ε

$[$][$S] $

reduce [S −→ bAd] : [$][$b][$bA][$bAd] $ → [$][$S] $

M

$[$][$b][$bA][$bAd] $

accept

If states [$ba] and [$da] are merged, LA(1)LR(0) parser is obtained.



ED

A parser is in the LR(k) position [γ] y if state [γ] is on the top of the stack and if y ∈ T ∗k is in the lookahead buffer.

PT

2.2. LL parsing

Assume G is $-augmented grammar, i.e., containing S 0 and $.

CE

String δ ∈ (N ∪ T )∗ is a viable suffix of G and [A → α • β, x] is an LL(k) item valid for

G R δ iff S 0 =⇒∗lm uAδ 0R =⇒lm uαβδ 0R = uαδ R and x ∈ FIRSTG k (δ ). Let VALIDLL(k) (δ)

denote the set of all LL(k) items of G valid for a viable suffix δ.

AC

The canonical LL(k) machine for G is a finite semiautomaton with a set of states con-

sisting of equivalence classes of viable suffixes in regard to LL(k)-equivalence ρLL(k) (denoted by [δ]LL(k) for some viable suffix δ), initial state [ε]LL(k) and transitions [δ]LL(k) X

→ [δX]LL(k) for all viable suffixes δ and δX of G.

The LA(k)LL(k 0 ) machine for G is a finite semiautomaton with a set of states

consisting of equivalence classes of viable suffixes in regard to ρLA(k)LL(k0 ) (denoted 8

ACCEPTED MANUSCRIPT

S 0 → $S$ •, ε

A → • a, ab

A → • a, ad

$

S → dA • b, b$

S 0 → $S • $, $

b

S → bAd •, $

A → a •, b$

S → dAb •, $

A

S

S → d • Ab, ab

S 0 → $ • S$, ba/da

d

$

S → • dAb, da

S 0 → • $S$, $b/$d

a S → bA • d, d$

d

A → a •, d$

CR IP T

a

A

S → b • Ad, ad b

S → • bAd, ba

AN US

Figure 2: The canonical LL(2) machine for G0 (the initial state [ε] is shown in the middle of the top row). The LA(2)LL(1) machine is obtained by merging states [$ba] (top left) and [$da] (top right).

by [δ]LA(k)LL(k0 ) for some viable suffix δ), initial state [ε]LA(k)LL(k0 ) and transitions [δ]LA(k)LL(k0 ) X → [δX]LA(k)LL(k0 ) for all viable suffixes δ and δX of G. Usually subscripts and superscripts are omitted if understood from the context.

M

Example 4. Consider grammar G0 from Example 2 again. The canonical LL(2) machine is shown in Figure 2. If states [$ba] and [$da] are merged, the LA(2)LL(1) machine 

ED

is obtained.

A pushdown automaton with transitions [δ][δA] y → [δXn ][δXn Xn−1 ] . . . [δXn Xn−1

PT

. . . X1 ] y iff [A → • X1 X2 . . . Xn , y] ∈ VALIDG LL(k) (δXn Xn−1 . . . X1 ) and [δa] ay → ε y

0 iff [A → α • aβ, ay] ∈ VALIDG LL(k) (δa) is called the canonical LL(k) or the LA(k)LL(k )

parser depending on what equivalence classes of viable suffixes are used, i.e., [·]LL(k) or

CE

[·]LA(k)LL(k0 ) . A parser is in the LL(k) position [δ][δX] y if states [δ][δX] are the two states on the top of the stack and if y ∈ T ∗k is in the lookahead buffer.

AC

Subscripts and superscripts are omitted whenever understood from the context.

Example 5. The canonical LL(2) parser based on the canonical LL(2) machine shown in Figure 2 contains shift actions [$bAd] da → ε a

[$dAb] ba → ε a

[$b] b$ → ε $

[$d] d$ → ε $

[$ba] ab → ε b

9

[$da] ad → ε d

ACCEPTED MANUSCRIPT

and produce actions [$][$S] da → [$][$b][bA][bAd] da [$b][$bA] ab →

[$b][$ba] ab

[$][$S] ba → [$][$d][dA][dAb] ba [$d][$dA] ad →

[$d][$da] ad

CR IP T

Parsing of the input string bad is therefore carried out as follows: stack input $[$][$S] bad$

transition

produce [S −→ bAd] : [$][$S] ba → [$][$d][dA][dAb] ba shift b : [$dAb] ba → ε a

$[$][$d][$dA][$dAb] bad$ $[$][$d][$dA] ad$

produce [A −→ a] : [$d][$dA] ad → [$d][$da] ad

$[$][$d][$da] ad$

shift a : [$da] ad → ε d

$[$] $

AN US

shift d : [$d] d$ → ε $

$[$][$d] d$

accept

If states [$ba] and [$da] are merged, LA(2)LR(1) parser is obtained.

3.1. Resolving LL conflicts

M

3. The backbone LL parsers



Assume an LL(k) parser for G ∈ LR(k) \ LL(k) is constructed using the $-augmented

ED

grammar G0 = hN, T, P, Si. For any production [A −→ X1 X2 . . . Xn ] ∈ P and viable suffix δ the parser contains an LL(k) produce action

(1)

PT

[δ][δA] y → [δ][δXn ][δXn Xn−1 ] . . . [δXn Xn−1 . . . X1 ] y

iff [A → X1 X2 . . . Xn •, z] ∈ VALID(δ) and y ∈ FIRST(X1 X2 . . . Xn z). As G 6∈ LL(k),

CE

the parser contains LL(k) produce-produce conflict in position [δXn Xn−1 . . . Xj+1 ][δXn Xn−1 . . . Xj ] yj

(2)

AC

iff there exist productions [Xj −→ ω1 ], [Xj −→ ω2 ] ∈ P so that yj ∈ FIRST(ω1 Xj+1 Xj+2 . . . Xn z) ∩ FIRST(ω2 Xj+1 Xj+2 . . . Xn z)

and y ∈ FIRST(X1 X2 . . . Xj−1 yj ). The idea of LLLR parsing is to use an embedded left LR(k) parser for parsing sub-

strings derived from Xj and thus avoid the LL(k) conflict in position (2). But as illustrated with the following example, the story is not so simple. 10

ACCEPTED MANUSCRIPT

Example 6. Take LR(1) grammar with the set of productions {[S −→ Aab], [A −→ a], [A −→ Aa]}. Position [$ba][$baA] a with LL(1) conflicting nonterminal A “on the top of the stack”

CR IP T

exhibits an LL(1) conflict between [A −→ a] and [A −→ Aa] on a. The idea is to parse strings derived from A, namely a+ , using an embedded left LR(1) parser for A. But on

input these strings are not followed by $ but by a = 1: ab$ from production [S −→ Aab]. No LR(1) parser can be made to parse them without pushing the final a from [S −→ Aab] and seeing b in the lookahead buffer first. So instead of a parser for A a parser for Aa is

AN US

needed as this one can always stop on time after seeing b in the lookahead buffer.



To avoid the problem illustrated in Example 6, a more general solution is needed. Namely, for any LL(k) conflicting position (2) appearing in some LL(k) produce action (1) an embedded left LR(k) parser for grammar

where ψj = Xj Xj+1 . . . Xn and

(3)

M

Gψj ,Lj = hN ∪ {S1 , S2 }, T, P ∪ {[S1 −→ S2 z] | z ∈ Lj } ∪ {[S2 −→ ψj ]}, S1 i

ED

Lj = {z | ∃[A → X1 X2 . . . Xn •, z] ∈ VALID(δ) : yj ∈ FIRST(Xj Xj+1 . . . Xn z)} for some S1 , S2 6∈ N , is started. The productions expanding S1 and S2 ensure the

PT

parser can stop instead of performing reduction on [S2 −→ Xj Xj+1 . . . Xn ] and keeps z ∈ Lj in the lookahead buffer provided that Gψj ,Lj ∈ LR(k). It’s trivial to prove that Gψ,L0 ∈ LR(k) and Gψ,L00 ∈ LR(k) iff Gψ,L0 ∪L00 ∈ LR(k).

CE

To start the embedded parser when the conflicting LL(k) position is reached, the

backbone parser uses mapping RIGHT that maps LL(k) conflicting positions to embed-

AC

ded parsers. Namely, RIGHT([δXn Xn−1 . . . Xj−1 ][δXn Xn−1 . . . Xj ] yj ) = hXj Xj+1 . . . Xn , Lj i.

The embedded parser should produce the left parse π(u) in the derivation S=⇒∗lm wAδ R =⇒lm wX1 X2 . . . Xn δ R =⇒∗lm w0 Xj Xj+1 . . . Xn δ R π(u)

π(v)

=⇒lm w0 uδvR δ R =⇒lm w0 uvδ R =⇒∗lm w0 uv w00 . 11

(4)

ACCEPTED MANUSCRIPT

More precisely, the embedded left LR(k) parser for hXj Xj+1 . . . Xn , Lj i should expect a

string uv derived from Xj Xj+1 . . . Xn followed by a string z = k : w00 ∈ Lj on input. On

termination, the parser should print out the left parse π(u) and return the suffix δv of

CR IP T

the viable suffix δδv . The backbone LL(k) parser replaces (n − j + 1) topmost states, i.e., [δXn ][δXn Xn−1 ] . . . [δXn Xn−1 . . . Xj ], with

[δYm ][δYm Ym−1 ] . . . [δYm Ym−1 . . . Y1 ] where δvR = Y1 Y2 . . . Ym and continues parsing.

AN US

Example 7. Consider $-augmented grammar for grammar with productions

[S −→ aDf ],[D −→ aA],[A −→ aBCd],[A −→ aBCe],[B −→ b] and [C −→ c]. Given aaabcdf on input, the backbone LL(1) parser performs the steps aaabcdf $

$[$][$S]

[S −→ aDf ]

M

$[$][$f ][$f D][$f Da] aaabcdf $ aabcdf $

$[$][$f ][$f A][$f Aa]

aabcdf $

ED

$[$][$f ][$f D]

$[$][$f ][$f A]

abcdf $

— LR(1) parser for hA, {f }i — df $

$[$][$f ]

f$

$[$]

$

RIGHT([$f ][$f A] a) = hA, {f }i [A −→ aBCd][B −→ b][C −→ c], d

CE

PT

$[$][$f ][$f d]

[D −→ aA]

After printing productions [S −→ aDf ] and [D −→ aA], the backbone LL(1) parser

reaches the position [$f ][$f A] a and starts the embedded left LR(1) parser for GA,{f }

AC

because A is an LL(1) conflicting nonterminal and thus RIGHT([$f ][$A] a) = hA, {f }i. The embedded parser reads abc as only upon detecting d in the lookahead buffer the production expanding A can be determined (as demonstrated in Example 11 later). By that time, productions expanding B and C have already been determined. To realign the stack of the backbone parser with the input, the backbone parser

removes [$f A] (the embedded parser for A has been used) and replaces it with [$f d] (the 12

ACCEPTED MANUSCRIPT

unread suffix of the string derived from A is derived from the viable suffix d returned by 

the embedded parser). 3.2. Indirect LL conflicts

Example 8. Take an LR(1) grammar with productions

CR IP T

Besides the problem illustrated in Example 6, another problem may arise.

[S −→ Aba],[A −→ B],[B −→ b] and [B −→ Ab].

Position [$ab][$abB] b exhibits an LL(1) conflict and much like in Example 6, the parser

AN US

for B started from [A −→ B] cannot stop with b from [S −→ Aba] in the lookahead

buffer since B derives b+ . Hence, a parser for at least Ab started from [S −→ Aba] is needed even though A is not an LL(1) conflicting nonterminal at all.



As demonstrated in Example 8, an embedded left LR(k) parser for hψj , Lj i might not be always able to terminate on time. In such cases, position [δ][δA] y in produce action

M

(1) must be considered an indirect LL(k) conflict.

To construct the backbone LL parser, the set of positions exhibiting LL(k) conflicts

(0)

LL

ED

is needed. The set of LL(k) produce-produce conflicting positions is computed as = {[δ][δXj ] yj |

(5)

PT

[A → αXj • β, z] ∈ VALID(δ) ∧

[Xj −→ ω1 ], [Xj −→ ω2 ] ∈ P ∧ yj ∈ FIRST(ω1 z) ∩ FIRST(ω2 z)}. (6)

CE

Namely:

1. If condition (5) holds, the LL parser can reach a configuration with the state [δXj ]

AC

on the top of the stack and the state [δ] just beneath it. Furthermore, strings derived from Xj may be followed by z.

2. Hence, the LL parser encounters an LL conflict in position [δ][δXj ] yj iff condition (6) holds because the parser cannot determine which production, [Xj −→ ω1 ] or [Xj −→ ω2 ], should be used in the next produce action.

13

ACCEPTED MANUSCRIPT

To add positions exhibiting indirect LL(k) conflicts sets = {[δ][δA] y | [A0 → αA • β, z] ∈ VALID(δ) ∧ [A −→ X1 X2 . . . Xn ] ∈ P ∧

(7)

yj ∈ FIRST(Xj Xj+1 . . . Xn z) ∧ y ∈ FIRST(X1 X2 . . . Xj−1 yj ) ∧

(8)

[δXn Xn−1 . . . Xj−1 ][δXn Xn−1 . . . Xj ] yj ∈ LL

(9)

CR IP T

(m)

LL

(m−1)

(m−1)

GXj Xj+1 ...Xn ,{z} 6∈ LR(k)} ∪ LL



(10)

must be computed for m > 0. Namely:

1. The first part of condition (7) ensures hat the during LL parsing, the topmost

AN US

state may be [δA] with state [δ] beneath it and that strings derived from A may be followed by z.

2. The second part of condition (7) and the entire condition (8) ensure that configuration [δ][δA] y is indeed reachable. Even more, using production [A −→ X1 X2 . . . Xn ] the parser reaches position [δXn Xn−1 . . . X2 ][δXn Xn−1 . . . X1 ] y and

M

after parsing the string derived from X1 X2 . . . Xj−1 it finally reaches position [δXn Xn−1 . . . Xj−1 ][δXn Xn−1 . . . Xj ] yj . 3. If position [δXn Xn−1 . . . Xj−1 ][δXn Xn−1 . . . Xj ] yj exhibits an LL conflict of any

ED

kind (condition 9) and this conflict cannot be resolved by an embedded LR(k) parser GXj Xj+1 ...Xn ,{z} (condition 10), the position [δ][δA] y exhibits an induced (m)

.

PT

LL(k) conflict and must be included in LL

Finally, the set LL containing positions exhibiting LL(k) conflicts of either kind, i.e.,

CE

produce-produce or indirect, is computed as (m)

LL = LL

(m)

when LL

(m−1)

= LL

AC

for some m > 0.

Example 9. Take an LR(1) grammar with productions [S −→ AbC],[A −→ B],[B −→ b],[B −→ Ab] and [C −→ aA]

14

ACCEPTED MANUSCRIPT

and the corresponding LL(1) parser with produce-produce actions [$][$S] b → [$][$C][$Cb][$CbA] b

[$][$B] b → [$][$b] b

[$][$A] b → [$][$B] b

[$][$B] b → [$b][$bb][$bA] b

[$b][$bA] b → [$b][$bB] b

[$b][$bB] b → [$b][$bb] b [$b][$bB] b → [$b][$bb][$CbbA] b

CR IP T

[$Cb][$CbA] b → [$b][$CbB] b [$][$C] a → [$][$A][$Aa] a

[$Cb][$CbB] b → [$Cb][$Cbb] b

[$Cb][$CbB] b → [$Cb][$Cbb][$CbbA] b

where [$b] = [$bb∗ ] = [$Cbbb∗ ], [$B] = [$bb∗ B] = [$Cbbb∗ B] and [$bA] = [$bb∗ A]. To save space, shift actions are not listed as they are trivial to deduce.

LL

(0)

AN US

The set of positions exhibiting LL(1) produce-produce conflicts is = {[$][$B] b, [$b][$bB] b, [$Cb][$CbB] b}

which is expected as B is the only LL(1) conflicting nonterminal. Then, LL (2)

(2)

as LL

(1)

= LL

(0)

= {[$Cb][$CbA] b} ∪ LL

as expected by analogy from Example 8.



M

and LL = LL

(1)

Note that throughtout this section no particular characteristics of the canonical LL(k)

ED

parser were assumed. Hence, if VALID(δ) is replaced above by VALID0 (δ) = ∪δ0 ∈[δ] VALID(δ 0 )

where [δ] denotes the state of the LA(k)LR(k 0 ) parser, then this parser can be used as

PT

the backbone LL(k) parser.

Finally, the table driven implementation of the backbone LL parser is shown as Algo-

CE

rithm 1. Except for the branch starting on line 10, it is a plain LL(k) parsing algorithm for k ≥ 1. The goto table is the representation of the LL(k) machine transition function, i.e., goto[[δ], X] = [δX] iff there is a transition from [δ] to [δX]. The action table

AC

is the LL parsing table, i.e., a representation of LL shift and produce actions, and thus uses two topmost states. However, it may contain an additional type of entry: the one that triggers the embedded LR parser to resolve the LL(k) conflict. If the LL(k) conflict hXj Xj+1 . . . Xn , Li needs to be resolved, the entire sentential form Xj Xj+1 . . . Xn is removed from the LL stack (the symbol Xj was popped into variable q earlier) as it is later replaced by the remaining viable suffix returned by the embedded LR parser. 15

ACCEPTED MANUSCRIPT

1

push(q0 ); push(goto[q0 , S])

2

while true :

3

q = pop(); q 0 = peek()

4

switch action[q 0q, input] :

5

case accept stop

6

case shift a input++

7

case produce [A −→ X1 X2 . . . Xn ]

CR IP T

Algorithm 1: The backbone LL parser

print([A −→ X1 X2 . . . Xn ])

9

q = peek(); for X = Xn , Xn−1 , . . . , X1 : { q = goto[q, X]; push(q) }

10

AN US

8

case right hψ, Li

11

for i = 2, 3, . . . , |ψ| : { pop() }

12

Y1 Y2 . . . Ym = embedded-lr-parser-for-hψ, Li()

13

q = peek(); for Y = Y1 , Y2 , . . . Ym : { q = goto[q, Y ]; push(q) } default error

M

14

Example 10. Consider the grammar from Example 9. The standard LL(1) action table

ED

is modified to include

as [A → B •, $] ∈ [$],

action[[$b][$bB], b] = right hB, {b}i

as [A → B •, b] ∈ [$b],

action[[$Cb][$CbB], b] = right hB, {b}i

as [A → B •, b] ∈ [$Cb],

PT

action[[$][$B], b] = right hB, {$}i

action[[$Cb][$CbA], b] = right hAbC, {$}i as [S → A • bC, b] ∈ [$Cb]

CE

and [S → AbC •, $] ∈ [$].

Note that the 2nd and 3rd actions specify embedded LR(1) parsers that contain LR(1)

AC

conflicts because they are made for parsing strings b+ derived from B and followed by another b. However, the 1st and 4th action start the embedded LR(1) parsers that parse the entire strings b+ and therefore the 2nd and 3rd action are never used. It is explained later how to remove the 2nd and 3rd action together with the parsers these two actions 

would trigger if used.

16

ACCEPTED MANUSCRIPT

To determine the next action at each step, the LL(k) parser needs to examine the two topmost states on its stack while the SLL(k) parser needs only the topmost, i.e., current, state. Therefore, when implementing the backbone parser it is recommended to

CR IP T

implement it as an SLL parser using the standard LL-to-SLL transformation scheme [18].

4. Embedded left LR parsers

The prefix u of uv in derivation (4) is not determined in advance. Ideally, the embedded left LR parser would parse the shortest prefix u for which the viable suffix δv can be uniquely determined, print out the left parse π(u) and return δv to the backbone parser.

requirement for the shortest prefix.

AN US

To reduce the size of the embedded left LR parser, it is sometimes better to sacrifice the

Different embedded left LR parsers are introduced in this section. The basic difference, influencing the class of grammars these parsers can be built for, is whether they are based on the LR or LALR parser. Among the embedded parsers based on one specific type, e.g., LR or LALR, the difference is about how soon and thus how often during

M

parsing the embedded parser prints a part of the left parse. In other words, to what extent are the parsing and printing out the productions interweaved as compared with

ED

Schmeiser-Barnard parser. By compromising on the most fine-grained interweaving of parsing and printing out productions one can reduce the size of the embedded parser.

PT

4.1. 0-equivalent paths and induced viable suffixes The content of this subsection is from [11] but it is essential for the understanding of

CE

the subsequent subsections. Let NkG be a nondeterministic LR(k) machine for $-augmented grammar G. A γ-path

p in NkG is a sequence of items p = i0 i1 . . . in iff i0 = [S 0 → • $S$, ε] is the initial item of

AC

NkG and γ = X1 . . . Xn where either ij−1 desc ij (and Xj = ε) or ij−1 passes-Xj ij (and

Xj ∈ N ∪ T ) for all j = 1, 2, . . . , n. In other words, p is a path in NkG starting with the

initial item and labelled γ. A path p is active for y iff in ∈ ACTIVE(y).

17

ACCEPTED MANUSCRIPT

Each γ-path induces a single viable suffix. Namely, a γ-path [S 0 → • α1 A1 β1 , x1 ] . . . [S 0 → α1 • A1 β1 , x1 ] (11)

CR IP T

[A1 → • α2 A2 β2 , x2 ] . . . [A1 → α2 • A2 β2 , x2 ] .. . [An−1 → • αn An βn , xn ] . . . [An−1 → αn • An βn , xn ], where γ = α1 α2 . . . αn (and An might be ε), denotes the derivation

S 0 =⇒ α1 A1 β1 =⇒ α1 α2 A2 β2 β1 =⇒ . . . =⇒ α1 α2 . . . αn An βn . . . β2 β1 .

By supplying the right parses (π(vj ))R of substrings vj derived from sentential forms βj ,

AN US

the rightmost derivation

π(v )

π(v )

S 0 =⇒rm α1 A1 β1 =⇒rm 1 α1 A1 v1 =⇒rm α1 α2 A2 β2 v1 =⇒rm 2 α1 α2 A2 v2 v1 =⇒rm . . . π(v )

=⇒rm α1 α2 . . . αn An βn vn−1 . . . v2 v1 =⇒rm n α1 α2 . . . αn An vn . . . v2 v1 , where γ = α1 α2 . . . αn is a viable prefix, is obtained. Likewise, if the left parses π(uj ) of

π(u )

M

substrings uj derived from αj are provided, the leftmost derivation π(u )

S 0 =⇒lm α1 A1 β1 =⇒lm 1 u1 A1 β1 =⇒lm u1 α2 A2 β2 β1 =⇒lm 2 u1 u2 A2 β2 β1 =⇒lm . . . π(u )

ED

=⇒lm u1 u2 . . . un−1 αn An βn . . . β2 β1 =⇒lm n u1 u2 . . . un An βn . . . β2 β1 is obtained where δ = β1R β2R . . . βnR An is a viable suffix induced by path (11). Moreover, δ is the only viable suffix induced by path (11).

PT

Once NkG is transformed into a determinstic machine MkG , all γ-paths start in the initial state [ε] of MkG and end in state [γ], but in general each of these γ-paths induces

CE

its own distinct viable suffix. Hence, the embedded parser can compute the induced viable suffix in position [γ] y iff all γ 0 -paths active for y, for any γ 0 ∈ [γ], induce the same viable suffix.

AC

To know at the time when a parser is constructed whether the parser can compute

induced viable suffix in position [γ] y or not, two relations are used. First, the relation of 0-equivalence among γ-paths, denoted =0 , is defined as i0 i1 . . . in =0 i00 i01 . . . i0n0 ⇐⇒ n = n0 ∧ ∀j ∈ {0, 1, . . . , n} : ij =0 i0j . For an unambiguous G, γ-paths p and p0 induce the same viable suffix iff p =0 p0 . 18

ACCEPTED MANUSCRIPT

Second, the relation (∼γ ) ⊆ VALID(γ) × VALID(γ) is defined as i1 ∼γ i2 ⇐⇒ ∀γ-paths p1 i1 , p2 i2 : p1 i1 =0 p2 i2 . Namely, two items are related iff all γ-paths ending with either of them are 0-equivalent

CR IP T

and thus induce the same viable suffix. During the embedded parser construction, ∼γ should be computed for every viable prefix γ. The computation of ∼γ is split into computing ∼γ for core items of VALID(γ) and for the non-core items separately: (∼γ ) = (∼Core(γ) ) ∪ (∼Clos(γ) )

AN US

The following rules are used:

.

• Computing ∼Core(·) : The only core item of VALID(ε) is in relation with itself as all ε-paths ending with it consist of this item only and are therefore 0-equivalent: [S 0 → • $S$, ε] ∼Core(ε) [S 0 → • $S$, ε].

Each item in Core(γX) is derived from exactly one item in Core(γ). Since relation

M

passes-X preserves 0-equivalence, two items in Core(γX) are in relation ∼γX iff their predecessors are in relation ∼γ :

ED

i1 ∼Core(γX) i2 ⇐⇒ ∃ i01 , i02 ∈ VALID(γ) :

 i01 passes-X i1 ∧ i02 passes-X i2 =⇒ i01 ∼γ i02 .

• Computing ∼Clos(·) : At first, every two 0-equivalent non-core items are assumed to

PT

be in relation ∼γ . Later on, every two non-core items with some descendants not in relation ∼γ , cease to be in relation ∼γ themselves: (0)

CE

i1 ∼Clos(γ) i2 ⇐⇒ i1 , i2 ∈ Clos(γ) ∧ i1 =0 i2 (m+1)

i1 ∼Clos(γ) i2 ⇐⇒ ∀ i01 , i02 ∈ VALID(γ) :

 (m) i01 desc i1 ∧ i02 desc i2 =⇒ i01 ∼Core(γ) i02 ∨ i01 ∼Clos(γ) i02 .

AC

Finally,

(m)

(m+1)

(m+1)

(∼Clos(γ) ) = (∼Clos(γ) ) =⇒ (∼Clos(γ) ) = (∼Clos(γ) ).

Lastly, relation ∼γ is an equivalence relation over [∼γ ] = {i ∈ VALID(γ) | i ∼γ i}.

The equivalence class of item i ∈ [∼γ ] is denoted by [i]∼γ . 19

ACCEPTED MANUSCRIPT

4.2. A model of the embedded left LR(k) parser Before describing different embedded left LR parsers, a model common to all of them is presented. As an embedded left LR parser must print out the left parse of the input

LR(k) parser must be modified in two ways.

CR IP T

it has read and return the viable suffix the rest of its input is derived from, the ordinary

The first modification is based on the technique introduced in the Schmeiser-Barnard

LALR parser [6]: when pushed, state [γX] is augmented by the left parse of the string derived from X. This is achieved by modifying each shift action [γ] ay → [γ][γa] y to

AN US

h[γ], πi ay → h[γ], πih[γa], εi y

and each reduce action [γ][γX1 ][γX1 X2 ] . . . [γX1 X2 . . . Xn ] y → [γ][γA] y to h[γ], πih[γX1 ], πX1 ih[γX1 X2 ], πX2 i . . . h[γX1 X2 . . . Xn ], πXn i y → h[γ], πih[γA], [A −→ X1 X2 . . . Xn ]·πX1 πX2 . . . πXn i y The shift action is trivial but when a reduction on [A −→ X1 X2 . . . Xn ] is performed, the

M

left parses of substrings derived from Xj are found on the stack, concatenated together and preceded with the production itself. (Note, however, that the modified actions are

ED

not “proper” actions as in general there are infinitely many left parses and thus each LR action appears in infinitely many instances all differing in left parses only.) The other modification is the use of relation 'γ defined as

PT

i1 'γ i2 ⇐⇒ ∀γ 0 ∈ [γ] : ∀γ 0 -paths p1 i1 , p2 i2 : p1 i1 =0 p2 i2 .

Two items are related iff for any γ 0 ∈ [γ] all γ 0 -paths are 0-equivalent, and therefore these

CE

γ 0 -paths all induce the same viable suffix. Hence, once the parser reaches state [γ], it can use equivalence classes of 'γ regardless of how it got into state [γ] as ('γ 0 ) = ('γ )

AC

for all γ 0 ∈ [γ]. For the moment let us just assume 'γ exists for any viable prefix γ.

Suppose now that the backbone parser for G = hN, T, P, Si must parse a prefix of a

string derived from ψ and followed by z ∈ L using an embedded left LR parser. Preferably

the embedded parser should parse only the prefix u of w where ψz =⇒∗rm γvz =⇒∗rm uvz = wz 20

for any z ∈ L.

ACCEPTED MANUSCRIPT

The idea is to use the embedded left LR parser that uses Schmeiser-Barnard augmentation of LR states for the $-augmented grammar Gψ,L with the start symbol S 0 and productions of P and

CR IP T

[S 0 −→ $S$], [S1 −→ S2 z] for all z ∈ L, and [S2 −→ ψ] where S 0 , S1 , S2 , $ 6∈ N ∪ T — see grammar (3). For after reading u the embedded parser reaches configuration

h[$], εih[X1 ], πX1 ih[X1 X2 ], πX2 i . . . h[X1 X2 . . . Xn ], πXn i y where γ = X1 X2 . . . Xn and y = k : vz$.

AN US

If all items of ['γ ] = {i ∈ VALID(γ) | i 'γ i} active for y belong to a single

nonempty equivalence class of 'γ , then for any γ 0 ∈ [γ] all γ 0 -paths of Gψ,L active for y are 0-equivalent and thus induce the same viable suffix. This viable suffix can be computed together with the left parse π(u) of prefix u by performing an operation called long reduction [11]:

M

long-reduction(Γ, [[A → α • β, x]]'γ ) = hπ, δβ R i where

(12)

hπ, δi = long-reduction0 (Γ, [[A → α • β, x]]'γ )

ED

long-reduction0 (h[$], εi, [[S2 → • ψ, z]]'ε ) = hε, εi long-reduction0 (Γ·h[γX], πX i, [[B → • ω, x]]'γX ) = hπ·[B −→ ω], δβ R i

(13)

PT

where hπ, δi = long-reduction0 (Γ·h[γX], πX i, [[A → α • Bβ, x0 ]]'γX )

long-reduction0 (Γ·h[γX], πX i, [[A → αX • β, x]]'γX ) = hπ πX , δi

CE

where hπ, δi = long-reduction0 (Γ, [[A → α • Xβ, x]]'γ )

The long reduction takes on input the stack contents and the equivalence class of 'γ .

It traverses the stack from top to bottom using equivalence classes of 'γ to reconstruct

AC

the appropriate γ-path (disregarding the lookahead strings which are not needed). Hence,

the computation of the left parse and the viable suffix is done backwards along the γ-path. Note that the predecessor of each [i]'γ is uniquely determined (if it was not, i 6∈ ['γ ] and [i]'γ would not exist). A production [B −→ ω] is added to the left parse only in case (13) where [B → • ω, x]

is derived from [A → α • Bβ, x0 ]. At the same time, β R is added to the viable suffix as β 21

ACCEPTED MANUSCRIPT

remains unparsed by the embedded parser. Similarly, β R in (12) is added to the viable suffix as [A → α • β, x] is the last item of the γ-path. Whenever [A → αX • β, x] is encountered on the γ-path, πX is added to the left parse as X has been parsed entirely.

CR IP T

Example 11. Consider the $-augmented grammar for GA,{f } with productions

[S1 −→ S2 f ],[S2 −→ A],[A −→ aBCd],[A −→ aBCe],[B −→ b] and [C −→ c]. First the embedded left LR(1) parser performs steps

abcdf . . .

$h[$], εi $h[$], εi h[$a], εi

bcdf . . .

$h[$], εi h[$a], εi h[$ab], εi

AN US

cdf . . .

$h[$], εi h[$a], εi h[$aB], [B −→ b]i

cdf . . .

$h[$], εi h[$a], εi h[$aB], [B −→ b]i h[$aBc], εi df . . . Then it performs the long reduction in position h[$aBc], εi d because for all γ 0 ∈ [$aBc] all γ 0 -paths ending with core item [C → c •, d] ∈ VALID($aBc) are 0-equivalent and this

only one such $aBc-path

M

item is the only item in Core($aBc) active for d. In fact, in this small example, there is

ED

[S 0 → • $S$, ε][S 0 → $ • S1 $, ε][S1 → • S2 f, $][S2 → • A, f ] [A → • aBCd, f ][A → a • BCd, f ][A → aB • Cd, f ][C → • c, d][C → c •, d].

PT

The result of the long reduction is long-reduction(h[$], εih[$a], εih[$aB], [B −→ b]ih[$aBc], εi, [[C → c •, d]]'$aBc ) =

AC

CE

h [A −→ aBCd][B −→ b][C −→ c], di

22

ACCEPTED MANUSCRIPT

because consecutive recursive calls of long-reduction0 yield long-reduction0 (h[$], εih[$a], εih[$aB], [B −→ b]ih[$aBc], εi, [[C → c •, d]]'$aBc ) = h[A −→ aBCd][B −→ b][C −→ c], di h[A −→ aBCD][B −→ b][C −→ c], di

CR IP T

long-reduction0 (h[$], εih[$a], εih[$aB], [B −→ b]i, [[C → • c, d]]'$aB ) =

long-reduction0 (h[$], εih[$a], εih[$aB], [B −→ b]i, [[A → aB • Cd, f ]]'$aB ) = h[A −→ aBCD][B −→ b], di

long-reduction0 (h[$], εih[$a], εi, [[A → a • Bcd, f ]]'$a ) = h[A −→ aBCD], εi long-reduction0 (h[$], εi, [[A → • aBcd, f ]]'$ ) = h[A −→ aBCD], εi

AN US

long-reduction0 (h[$], εi, [[S2 → • A, f ]]'$ ) = hε, εi (This long reduction is used in Example 7.)



Once the long reduction is carried out by reaching state [$], the computed left parse is printed out and the control is transferred back to the backbone LL parser together

M

with the computed viable suffix.

The embedded LR parser for hψ, Li is shown as Algorithm 2. Except for the branch starting on line 4 the algorithm is the same as the Schmeiser-Barnard parser [6]. However,

ED

if the algorithm ever runs into a situation when the prefix of the left parse can be computed (the condition in line 4), the prefix is computed and printed out, and the

PT

remaining viable suffix is returned. Otherwise, the algorithm parses the entire string derived from ψ, prints out its left parse and returns.

CE

4.3. Embedded full left LR(k) parser If the embedded parser that computes the induced viable suffix at absolutely the

first opportunity is sought for regardless of its size, the embedded full left LR(k) parser

AC

described in [11] should be used. It is based on a refinement of LR(k)-equivalence γ ρ γ 0 ⇐⇒ γ ρLR(k) γ 0 ∧ ('γ ) = ('γ 0 )

where ('γ ) = (∼γ ). The first part of the conjunction guarantees the parser parses its input correctly. The second part guarantees that if the parser can compute the induced viable suffix in state [γ]ρ for some lookahead string y, it can do this regardless of how it 23

ACCEPTED MANUSCRIPT

1

push(hq0 , εi)

2

while true :

3

hq, πi = peek()

4

if left[q, input] 6= ⊥ :

5

hπ, δi = long-reduction(stack , left[q, input])

6

print(π); return δ

7

switch action[q, input] : case accept print(π); return ε

9

case shift q 0 push(hq 0 , εi); input++

AN US

8

CR IP T

Algorithm 2: The embedded LR parser

case reduce [A −→ X1 X2 . . . Xn ]

10 11

πA = ε; for i=1,2,. . . ,n : { hqi , πi i = pop(); πA = πi ◦ πA }

12

hq, πi = peek(); push(hgoto[q, A], [A −→ X1 X2 . . . Xn ] ◦ πA i) default error

13

M

got into state [γ]ρ : for any γ 0 ∈ [γ]ρ , all γ 0 -paths are 0-equivalent and thus induce the same viable suffix.

(14)

PT

ED

To trigger the long reduction, the parser uses function LEFT defined as   [i] ∀i0 ∈ Core(γ) ∩ ACTIVE(y) : i0 =0 i 'γ LEFT( [γ] y ) =  ⊥ otherwise.

Namely, if all core items valid for γ and active for y belong to the same equivalence class,

CE

all γ 0 -paths ending with a core items and active for y are 0-equivalent and induce the same viable suffix for all γ 0 ∈ [γ].

AC

Example 12. Consider that the embedded full left LR(1) parser must be constructed based on grammar GA,{c} with productions [S 0 −→ $S1 $],[S1 −→ S2 c],[S2 −→ A],

[A −→ aBa],[A −→ aBb],[A −→ bBaa],[A −→ bBab],[A −→ bBbb],[B −→ a].

The relevant part of the LR(1) machine for the embedded full left LR(1) parser based on GA,{c} is shown in Figure 3. Each item in Figure 3 has at most one predecessor 24

ACCEPTED MANUSCRIPT

$a

$aa

A → a • Ba, c

S 0 → $ • S1 $, ε S1 → • S2 c, $

a

A → a • Bb, c a

B → a •, b

B → • a, a B → • a, b

S2 → • A, c

$ba B → a •, a

A → • aBa, c A → • aBb, c

B → a •, a

$b

b

a

A → b • Baa, c

A → • bBab, c

A → b • Bab, c

A → • bAbb, c

A → b • Bbb, c B → • a, a B → • a, b

$bB B

A → bB • aa, c

$bBa

a

AN US

A → • bBaa, c

B → a •, b

CR IP T

$

A → bB • ab, c A → bB • bb, c

A → bBa • a, c A → bBa • b, c

Figure 3: A subgraph of the LR(1) machine for the embedded full left LR(1) parser based on GA,{c} from Example 12: states [$aa] and [$ba] are distinct because ('$aa ) 6= ('$ba ).

M

except item [B → • a, a] ∈ VALID($b) that has two non-0-equivalent predecessors [A → b • Baa, c] and [A → b • Bab, c]. Hence, i ∼γ i for every item i ∈ VALID(γ) in Figure 3,

ED

where γ ∈ {$, a, aa, b, bb}, except for items [B → • a, a] ∈ VALID($b) and [B → a •, a] ∈ VALID($ba). Even more, to keep the example small, the grammar has been chosen so that [i]∼γ = {i} for all i iff i ∼γ i.

PT

By definition of the embedded full left LR parser, ('γ ) = (∼γ ) for every viable prefix γ of GA,{c} and thus [$aa] and [$ba] denote distinct states because

CE

('$aa ) = {{[B → a •, a]}, {[B → a •, b]}} and ('$ba ) = {{[B → a •, b]}}

(written as sets of equivalence classes over [∼$aa ] and [∼$ba ], respectively). Further-

AC

more, LEFT([$aa] a) = [[B → a •, a]]'$aa , LEFT([$aa] b) = [[B → a •, b]]'$aa and LEFT([$ba] b) = [[B → a •, b]]'$ba while LEFT([$ba] a) = ⊥. In other words, in state [$aa] the long reduction can be started on either a or b while

in state [$ba] it can be started only on b. The states must be kept distinct since the

decision on whether to start the long reduction or not must be based on the topmost state and the contents of the lookahead buffer only. 25

ACCEPTED MANUSCRIPT

If a string starting with baa is parsed, the reduction on [B −→ a] instead of the long reduction on a must be performed in [$ba] and the long reductions on a or b are 

performed later in state [$bBa].

CR IP T

4.4. Embedded partial left LR(k) parser Because relation ρ used by the embedded full left LR(k) parser is a refinement of the LR(k)-equivalence, the number of its states can grow far beyond the number of states

of the canonical LR(k) parser. To keep the number of states the same, slightly different approach should be used.

Instead of replacing LR(k)-equivalence, relation 'γ defined as

AN US

i1 'γ i2 ⇐⇒ ∀γ 0 ∈ [γ] : i1 ∼γ 0 i2 ,

where [γ] remains the state of the canonical LR(k) parser, is used in definition (14) of function LEFT: that the long reduction can safely be made is now guaranteed by 'γ instead of by ρ (which remains the same as in the canonical LR(k) parser). The number of states remains the same, but as ('γ ) ⊆ (∼γ ) the embedded partial

M

left LR(k) parser cannot compute the viable suffix in all cases that the embedded full left LR(k) parser can.

ED

Example 13. Consider that the embedded partial left LR(1) parser must be constructed based on grammar GA,{c} from Example 12. The relevant part of the LR(1) machine for the embedded partial left LR(1) parser

PT

based on GA,{c} is shown in Figure 4. Unlike in Example 12, [$aa] = [$ba] even though

CE

(∼$aa ) 6= (∼$baa ) because

('$aa ) = ('$ba ) = {{[B → a •, b]}}.

Thus, if a string parsed by the embedded parser starts with aab or bab, the long reduction

AC

can be performed in [$aa] = [$ba], but if it starts with aaa of baa, the long reductions must be performed in [$aB] on a or in [$bBa] on a or b, respectively.



It is evident from Example 13 that the full left LR(1) parser can sometimes perform

long reduction earlier than the partial left LR(1) parser — compare Examples 12 and 13. If k > 1, it can also avoid shifting some input symbols on the LR(k) stack before starting the long reduction. 26

ACCEPTED MANUSCRIPT

$a

$b

A → a • Ba, c

S 0 → $ • S1 $, ε S1 → • S2 c, $

A → aB • a, c

B

A → a • Bb, c a

A → aB • b, c

B → • a, a

a

B → • a, b

S2 → • A, c

$aa, $ba B → a •, a

A → • aBa, c A → • aBb, c

$b

b

A → b • Baa, c

A → • bBab, c

A → b • Bab, c

A → • bAbb, c

A → b • Bbb, c B → • a, b

$bB B

A → bB • aa, c

$bBa

a

AN US

A → • bBaa, c

B → • a, a

B → a •, b

a

CR IP T

$

A → bB • ab, c A → bB • bb, c

A → bBa • a, c A → bBa • b, c

Figure 4: A subgraph of the LR(1) machine for the embedded partial left LR(1) parser based on GA,{c} from Example 12: [$aa] and [$ba] denote the same state.

M

4.5. Embedded full left LA(k)LR(k 0 ) parser

The embedded full left LA(k)LR(k 0 ) parser, where 0 ≤ k 0 ≤ k, can be defined similarly

ED

to the embedded full left LR(k) parser by using refinement LA(k)LR(k 0 )-equivalence defined as

γ ρ γ 0 ⇐⇒ γ ρLA(k)LR(k0 ) γ 0 ∧ ('γ ) =k0 ('γ 0 ).

(15)

CE

PT

Likewise, the definition of function LEFT is modified to    [i] ∀i0 ∈ ∪γ 0 ∈[γ] Core(γ 0 ) ∩ ACTIVE(y) : i0 =k0 i 'γ LEFT( [γ] y ) =  ⊥ otherwise.

Example 14. Consider that the embedded full left LR(1) parser must be constructed based on the grammar with productions

AC

[S 0 −→ $S1 $],[S1 −→ S2 c],[S2 −→ A],

[A −→ aEBa],[A −→ aEBb],[A −→ bEBaa],[A −→ bEBab],[A −→ bEBbb],[B −→ a],

and

[E −→ E + T ],[E −→ E − T ],[E −→ T ], [T −→ T + F ],[T −→ T − F ],[T −→ F ], [F −→ num],[F −→ id],[F −→ (E)]. 27

ACCEPTED MANUSCRIPT

The situation is very similar to the one in Example 12. At least the string derived from aEB or bEB must be parsed by the embedded left LR(1) parser but unlike in Example 12 it can be of arbitrary length. However, apart from other states, the full left LR(1) parser needs 33 states for parsing arithmetic expressions derived from E while full

CR IP T

left LA(1)LR(0) parser needs only 18 states and performs the long reduction as soon as



the full left LR(1) parser. 4.6. Embedded partial left LA(k)LR(k 0 ) parser

The embedded partial left LA(k)LR(k 0 ) parser, where 0 ≤ k 0 ≤ k, can be defined similarly to the embedded partial left LR(k) parser. To keep the number of its states the

equivalence and relation 'γ defined as

AN US

same as the number of states of the LA(k)LR(k 0 ) parser, it is based on LA(k)LR(k 0 )-

i1 'γ i2 ⇐⇒ ∀γ 0 ∈ [γ] : ∃i01 , i02 ∈ VALIDk (γ) : i01 =k0 i1 ∧ i02 =k0 i2 =⇒ i01 ∼γ 0 i02 where [γ] remains the state of the LA(k)LR(k 0 ) parser. Function LEFT is defined as in

M

(15) for the embedded full left LA(k)LR(k 0 ) parser.

Example 15. If the embedded partial left LA(1)LR(0) parser for the grammar from

ED

Example 14 is constructed, it starts the long reductions in the same positions as the embedded partial left LR(1) parser but keeps the number of states low as the embedded 

PT

full left LA(1)LR(0) parser.

5. Minimising LLLR parsers

CE

If an LLLR parser is constructed as explained so far, certain items and states will

never be used. For instance, state [$A] (not shown in Figure 3) in Example 12 will be

AC

constructed even though it is not needed. Likewise, item [A → bB • bb, c] ∈ VALID($bB)

and its derivatives will not be used. Hence, after the parser is constructed, all not needed parts must be eliminated. 5.1. Minimising embedded left LR parsers Certain items of an embedded LR(k) parser are never needed because the long reduction terminates LR parsing before the entire input string described by the grammar 28

ACCEPTED MANUSCRIPT

is read and analysed. To remove them, a simple idea is used: if LEFT([γ] y) = [i]'γ , this item i starts the long reduction and hence all items derived from it are not necessary (unless they are derived independently of i as well). The entire process is slightly more complicated as certain items might be needed for

CR IP T

certain contents of the lookahead buffer and not for the others. Hence, a set ACT[γ] ([A →

α • β, x]) is computed for all [A → α • β, x] ∈ VALID(γ). It contains all strings y ∈ FIRST(βx) that the embedded left LR parser can encounter in the lookahead buffer when in state [γ] but does not trigger the left reduction on.

Sets ACT[γ] (i) can be computed simultaenously for all [γ] and i from empty sets by iteratively applying the following rules until no more strings can be added to any set

AN US

ACT[γ] (i):

• Obviously y ∈ ACT[ε] ([S 0 → • $S$, ε]) for all y ∈ FIRST($S$). • [B → • ω, y 0 ] acts on y in state [γ] iff there another item [A → α • Bβ, x] acts on y in [γ] (and thus y does not trigger long reduction in [γ] anyway):

M

y ∈ ACT[γ] ([A → α • Bβ, x]) ∧ [B −→ ω] ∈ P y 0 ∈ FIRST(βx) where y ∈ FIRST(ωy 0 )

ED

y ∈ ACT[γ] ([B → • ω, y 0 ])

In other words, as there is an action on y in [γ], the input might contain something

PT

starting with y and derived by [B −→ ω]. • [A → αa • β, x] acts on y 0 in state [γa] iff y 0 does not trigger long reduction in [γa]

CE

and [A → α • aβ, x] acts on y in [γ] where y 0 denotes the lookahead buffer shifted

AC

one symbol to the right: y ∈ ACT[γ] ([A → α • aβ, x])

y 0 ∈ FIRST(βx) where y ∈ FIRST(ay 0 )

LEFT([γa] y 0 ) = ⊥

y 0 ∈ ACT[γa] ([A → αa • β, x])

• [A → αB • β, x] acts on y 0 in state [γB] iff y 0 does not trigger long reduction in [γB] and the parser performs reduction on [B −→ ω] in state [γω] on y 0 where y 0 29

ACCEPTED MANUSCRIPT

denotes the lookahead buffer shifted one symbol to the right: y ∈ ACT[γ] ([A → α • Bβ, x]) ∧ [B −→ ω] ∈ P y 0 ∈ FIRST(βx) where y ∈ FIRST(ωy 0 )

y 0 ∈ ACT[γB] ([A → αB • β, x])

CR IP T

y 0 ∈ ACT[γω] ([B → ω •, y 0 ]) ∧ LEFT([γB] y 0 ) = ⊥

Hence, an item i ∈ VALID(γ) is needed in [γ] iff it acts on some y in [γ], i.e.,

ACT[γ] (i) 6= ∅, or it actually triggers the long reduction on y in [γ]. An item [A → αX • β, x] actually triggers the long reduction on y in [γX] iff LEFT([γX] y) = [[A →

AN US

αX • β, x]]'γ and

• either X ∈ T and y 0 ∈ ACT[γ] ([A → α • Xβ, x]) where y 0 ∈ FIRST(Xy), • or [X −→ ω] ∈ P and y ∈ ACT[γω] ([X → ω •, y]).

A state [γ] is needed iff there exists an item i needed for [γ]. An LR action is needed iff

M

it is based on the item which is needed.

Example 16. Let us consider the grammar GA,{c} from Example 12 again. Observe,

ED

for instance, that b 6∈ ACT[$ba] ([B → a •, b]) because ACT[$ba] ([B → a •, b]) = ∅ as LEFT([$ba] b) = [[B → a •, b]]'$ba . Hence, b 6∈ ACT[$bB] ([A → bB • bb, c]) and thus [A → bB • bb, c] is not needed in [$bB].



PT

During the long reduction, a viable suffix is computed. This viable suffix is used to realign the stack of the backbone parser with the input after the embedded parser read a

CE

part of the input string. During the realignment, some LL states of the backbone parser will be pushed on the stack. To compute these states, a connection between LR(k) items of the embedded parser and LL states of the backbone parser must be established.

AC

When an embedded LR(k) parser for hXj Xj+1 . . . Xn , Lj i is started in LL position

(2), it starts in state [$] as [S2 → • Xj Xj+1 . . . Xn , z] ∈ VALID($), for all z ∈ Lj . Thus, [S2 → • Xj Xj+1 . . . Xn , z] Bγ [δXn Xn−1 . . . Xj ]

In general case, if [A → α • Bβ, x] Bγ [δ 0 B], substrings derived from B are parsed by

the embedded parser because of [B → • ω, x0 ] ∈ VALID(γ) and would be parsed by the 30

ACCEPTED MANUSCRIPT

backbone parser because of [B → • ω, y] ∈ VALID(δ 0 ω R ) where y ∈ FIRST(ωx0 ). Hence, [A → α • Bβ, x] desc [B → • ω, x0 ] ∧ [A → α • Bβ, x] Bγ [δ 0 B]

.

[B → • ω, x0 ] Bγ [δ 0 ω R ]

CR IP T

Likewise, if substrings derived from X in [A → α • Xβ, x] has already been parsed, then [A → α • Xβ, x] passes-X [A → αX • β, x] ∧ [A → α • Xβ, x] Bγ [δ 0 X] [A → αX • β, x] Bγ [δ ] 0

.

The relations Bγ , for any viable prefix γ, can be computed by these rules.

AN US

Now, if i = [A → α • Xβ, x] is needed in [γ] and LEFT([γ] y) = [i]'γ , then β R X will be a part of the viable suffix computed by the embedded parser as seen in (12). Hence, the embedded LR(k) parser for hXj Xj+1 . . . Xn , Lj i started in LL position (2) induces

position [δ 0 ][δ 0 X] y where i Bγ [δ 0 X] because the backbone parser will reach it after the embedded parser terminates.

Likewise, if y ∈ ACT[γ] ([A → α • BY β, x]) and therefore [A → α • BY β, x] is needed

M

in [γ], and if [[A → α • BY β, x]]'γ and [[B → • ω, y]]'γ where y ∈ FIRST(Y βx), then β R Y will be a part of the viable suffix as seen in (13). Therefore, the embedded LR(k)

ED

parser for hXj Xj+1 . . . Xn , Lj i started in LL position (2) induces positions [δ 0 ][δ 0 Y ] y 0

for all y 0 ∈ ACT[γX] ([A → αB • Y β, x]) where [A → α • BY β, x] Bγ [δ 0 Y X] because the

PT

backbone parser will reach it after the embedded parser terminates. Example 17. Take an embedded LR(1) parser for hAbC, {$}i started in position [$Cb]

CE

[$CbA]|a of the backbone LL(1) parser from Example 9. The relevant part of it is shown in Figure 5 where each item is shown together with its ACT set and its corresponding viable suffix.

AC

All four items with an empty ACT can start the long reduction, but only items based

on productions [B −→ a] and [B −→ Aa] actually start long reduction — the other two

items would start it afterwards, but by then the control has been passed to the backbone parser already. Hence, items [S → A • bC, $] and [A → B •, b] are not needed and are

eliminated together with all items derived from them. As [S → AbC •, $] is eliminated, the item [S 0 → $S2 • $, ε] is eliminated as well. 31

ACCEPTED MANUSCRIPT

$

$B

S 0 → $ • S1 $, ε{a}

A→ B•, b

S1 → • S2 $, $ {a}

$Cb

A→ B•, a{a} $Cb

B

{a}$CbB

B →• a, b

{a}$Cba

B →• Aa, b

{a}$CbaA

A→• B, a

{a}$CbaA

B →• a, a

{a}$Cba

B →• Aa, a

{a}$CbaA

$a

a

B → a•, b

$Cb

B → a•, a{a} $Cb

A

$b S2 → A • bC, $

$Cb

$aa, $ba

a

B → Aa•, b

AN US

A→• B, b

CR IP T

S2 → • AbC, $ {a}$CbA

B → A • a, b {a} $Cba B → A • a, a {a} $Cba

$Cb

B → Aa•, a{a} $Cb

Figure 5: A subgraph of the LR(1) machine for the embedded partial left LR(1) parser for hAbC, {$}i started in position [$Cb][$CbA]|a of the backbone LL(1) parser from Example 9.

This parser induces exactly one LL(1) positions, namely [$C][$Cb] a: nothing is left

M

to be parsed in items [B → a •, b] [B → Aa •, b], the same is true for the item [A → • B, b], but in [S2 → • AbC, $] the string Cb is added to the resulting viable suffix. However,

ED

only the position with b on the top of the stack is induced as only the position for the first symbol of (Cb)R is induced.



PT

5.2. Minimising the backbone LL parser To remove all unneeded actions of the backbone LL parser a set USED of all reachable

CE

positions is computed using the following straightforward rules: • Obviously [$][$S] y ∈ USED for all y ∈ FIRST(S$).

AC

• If [δ][δA] x is reachable but not LL(k) conflicting position, then the state corresponding to the first symbol, i.e., X, on the rightside of a production expanding A is reachable as it will be placed on the top of the stack after the produce action in [δ][δA] x is used: [δ][δA] x ∈ USED \ LL ∧ [A → • Xβ, x] ∈ VALID(δβ R X) [δβ R ][δβ R X] x ∈ USED 32

ACCEPTED MANUSCRIPT

The states corresponding to β will be either declared reachable by the next rule or are not reachable because some embedded parser is used instead. • Likewise, if position corresponding to a symbol in the middle of the production is is also reachable:

CR IP T

reachable but not LL(k) conflicting, the position coresponding to the next symbol

[δβ R Y ][δβ R Y X] x ∈ USED \ LL

[A → α • XY β, x] ∈ VALID(δβ R Y X) ∧ x ∈ FIRST(Xy) [δβ R ][δβ R Y ] y ∈ USED

AN US

• If [A → α•Bβ, y] ∈ VALID(δβ R B) and [δβ R ][δβ R B] y ∈ USED∩LL, the backbone parser reaches LL(k) conflicting position [δβ R ][δβ R B] y. It starts an embedded LR(k) parser for hBβ, Li where

L = {z | ∃[A → αBβ •, z] ∈ VALID(δ) : yFIRST(Bβz)}. Hence, all induced positions of hBβ, Li started in δβ R B are added to USED.

M

Finally, the backbone LL(k) parser contains all LL(k) actions on positions that are reachable, i.e., in USED, but not LL(k) conflicting (as embedded LR(k) parsers will be

ED

started using mapping RIGHT).

Example 18. To minimise the backbone LL parser described in Example 9, [$][$S] a and then [$Cb][$CbA] are added to USED first. Based on [$Cb][$CbA] an embedded

PT

LR(1) parser for hAbC, {$}i and $CbA will be started; it induces [$C][$Cb]|b which is added to USED. Then, [$][$C] and finally [$A][$Aa] are added. The latter one starts

CE

an embedded LR(1) parser for hA, {$}i and $A, but this parser does not induce any position.

AC

Hence, the parser contains LL(1) produce actions [$][$S] a → [$][$C][$Cb]hAbC, {$}i a [$][$C] a → [$]hA, {$}i[$Aa] a

and LL(1) shift actions [$Cb] b → ε ε [$Aa] a → ε ε 33



ACCEPTED MANUSCRIPT

6. A case study: Java To test the new parsing method against the real programming language, Java 1.0 has been chosen. There are two reasons for this: (a) there is an official LALR grammar for

CR IP T

Java 1.0 available (Gosling et. al, The JavaTM Language Specification, 1996), and (b) choosing a language primarily made for parsing in a top-down manner would be unfair.

As the official Java 1.0 grammar is LALR, it contains a lot of left-recursive nonterminals. However, one must distinguish between two kinds of left-recursive nonterminals: 1. Essential left-recursive nonterminals are those where left recursion is used to achieve the proper form of derivation trees.

AN US

The best examples of essential left-recursive nonterminals are those describing arith-

metic expressions as the left recursion is needed to emphasize the left associativity of various arithmetic operators. For instance, productions AdditiveExpression −→ MultiplicativeExpression

AdditiveExpression −→ AdditiveExpression + MultiplicativeExpression

M

AdditiveExpression −→ AdditiveExpression - MultiplicativeExpression describing the structure of additive expressions should not be changed as otherwise

ED

the programmer needs to modify the derivation tree manually. 2. Nonessential left-recursive nonterminals are those where the left recursion is used to make a grammar more suitable for a particular parsing algorithm.

like

PT

For instance, the Java 1.0 grammar includes a number of non-essential left recursion

CE

ClassBodyDeclarations −→ ClassBodyDeclaration ClassBodyDeclarations −→ ClassBodyDeclarations ClassBodyDeclaration

AC

that can easily be rewritten to ClassBodyDeclarations

−→ ClassBodyDeclaration ClassBodyDeclarationsopt

ClassBodyDeclarationsopt −→ ε

ClassBodyDeclarationsopt −→ ClassBodyDeclaration ClassBodyDeclarationsopt using the established method for immediate left recursion elimination. 34

ACCEPTED MANUSCRIPT

Nonterminal symbol ClassBodyDeclarations describes the entire contents of a class — all declarations within a class together. If it is left-recursive (the original productions), virtually entire Java source is parsed by the embedded left LR(1) parser — everything except the package header, import declarations and the class header. If the

CR IP T

non-left-recursive productions are used, then every declaration within a class can be parsed separately using either LL or LR parsing, depending on (a) the structure of a

particular declaration and (b) what other left-recursive productions are replaced by nonleft-recursive ones.

By eliminating the left recursion (using the well-known recipe illustrated above) in

AN US

productions for only ImportDeclarations, TypeDeclarations,

InterfaceMemberDeclarations, ClassBodyDeclarations, and BlockStatements

the percentage of the input that is parsed using the embedded LR(1) parsers drops significantly: (typically) from 98 % to 54 % (but one should be warned that by using the

M

right set of Java sources, especially the second number can be tailored to whatever value one chooses). Instead of parsing the entire contents of an interface or a class using the

ED

embedded left LR(1) parser, individual interface members or statements within a method can be reached using the backbone LL(1) parser — and at the same time, expressions can be parsed using the left-recursive productions producing the most appropriate derivation

PT

trees. Furthermore, approx. 54 % of the input is not parsed using one embedded LR(1) parser as a single substring but rather as a number of short substrings parsed by several

CE

instances of embedded LR(1) parsers. Reducing the length of the substrings that need to be parsed using the individual

embedded LR(k) parser is crucial for efficient LLLR parsing. In case of Java 1.0, these

AC

substrings encompass mostly expressions (which are seldom longer than a few dozen symbols) and prefixes of the field or method declarations. As the left parse of the Java program is produced during LLLR(1) parsing, semantic

routines can be inserted at every position within a grammar, i.e., on both sides of every symbol found on the right side of some production. This is also true for several other parsing methods like SLL, LL(∗ ) and ALL(∗ ), but not for left-corner parsing and its 35

ACCEPTED MANUSCRIPT

derivatives like XLC(1) and LAXLC(1) parsing. For instance, methods like the one described in [21] can handle semantic routines in only 41.95 % of all positions within a grammar. Finally, if Java is considered, different kinds of embedded left LR(1) parsers do not

CR IP T

perform significantly different in regards to the length of the substrings parsed in the

bottom-up manner. In fact, all four parsers yield almost the same results. The difference,

however, shows in the size of the grammar: where the difference between full and partial parser is minimal, the difference between the canonical LR(1) and the LA(1)LR(0) is the same as between ordinary, i.e., not left, canonical LR(1) and LA(1)LR(0) parsers.

AN US

7. Conclusion

As already mentioned in [1], the main advantage of LLLR parsing (which can be used to parse any LR(k) language) over left-corner parsing is that it produces the left parse of the input string while left-corner parsing produces the mixed-strategy parse. If compared with LL(∗ ) and ALL(∗ ) parsing, LLLR parser runs in linear time and does not involve

M

backtracking.

More precisely, an LLLR parser performs exactly the same number of actions as an

ED

LR or LL parser (provided that in the case of a conflict the later could guess/predict each production correctly): it shifts each terminal symbol exactly once and it either produces or reduces each production of the input’s parse exactly once. The left parse construction

PT

during embedded LR parsing the LLLR parser is as fast as in the Schmeiser-Barnard parser but unlike that parser the LLLR parser needs to compute only subparses of the

CE

entire left parse. During a transfer of control to and from an embedded LR parser, some states of the viable suffix might be popped when the embedded LR parser for hψ, Li is triggered and pushed back when it returns — but sentential forms ψ are always suffixes

AC

of the right sides of productions and therefore short compared with the input. However, it must be admitted once again that there is no silver bullet in parsing.

The LLLR parsing works well for certain LR(k) grammars, i.e., typically for grammars

where LL(k) conflicting nonterminals appear close to the bottom of the derivation trees. Otherwise, if LL(k) conflicting nonterminals are found near the top of large subtrees of the derivation tree, LLLR parsing is not to be advised — one should either modify the 36

ACCEPTED MANUSCRIPT

grammar or use some other parsing method (the situation is similar to LR(k) parsing where, if an LR(k) grammar is right-recursive, the parser works but consumes a lot of stack unnecessarily). The basic idea of LLLR parsing used in this paper is the same as in [1]. However, in

CR IP T

this paper it is shown how the canonical LL(k) parsers and LA(k)LL(k 0 ) parsers can be

used for the backbone parser instead of simpler SLL(k) parsers, and how different embed-

ded left LR(k) parsers based on either the canonical LR parser or on the LA(k)LR(k 0 ) parser can be used to resolve the LL(k) conflicts.

AN US

References

[1] B. Slivnik, LLLR parsing: a combination of LL and LR parsing, in: Proceedings of the 5th Symposium on Languages, Applications and Technologies (SLATE’16), Dagstuhl Publishing, Maribor, Slovenia, 2016, pp. 5:1–5:13.

[2] E. Scott, A. Johnstone, GLL parsing, Electronic Notes in Theoretical Computer Science 253 (7) (2010) 177–189.

[3] A. J. Demers, Generalized left corner parsing, in: Proceedings of the 4th ACM SIGACT-SIGPLAN

M

symposium on Principles of programming languages POPL’77, ACM, ACM, Los Angeles, CA, USA, 1977, pp. 170–182.

[4] R. N. Horspool, Recursive ascent-descent parsing, Journal of Computer Languages 18 (1) (1993)

ED

1–16.

[5] M.-J. Nederhof, Generalized left corner parsing, in: Proceedings of the sixth conference on European chapter of the Association for Computational Linguistics EACL’93, Association for Computational Linguistics, Stroudsburg, PA, USA, 1993, pp. 305–314.

PT

[6] J. P. Schmeiser, D. T. Barnard, Producing a top-down parse order with bottom-up parsing, Information Processing Letters 54 (6) (1995) 323–326. [7] B. Slivnik, B. Vilfan, Producing the left parse during bottom-up parsing, Information Processing

CE

Letters 96 (6) (2005) 220–224.

[8] E. Scott, A. Johnstone, R. Economopoulos, BRNGLR: a cubic Tomita-style GLR parsing algorithm, Acta Informatica 44 (6) (2007) 427–461.

AC

[9] T. Parr, K. Fischer, LL(*): The foundation of the ANTLR parser generator, ACM SIGPLAN Notices - PLDI’10 46 (6) (2011) 425–436.

[10] B. Slivnik, The embedded left LR parser, in: Proceedings of the Federated Conference on Computer Science and Information Systems, IEEE Computer Society Press, Szczecin, Poland, 2011, pp. 871– 878. [11] B. Slivnik, LL conflict resolution using the embedded left LR parser, Computer Science and Information Systems 9 (3) (2012) 1105–1124.

37

ACCEPTED MANUSCRIPT

[12] M. Might, D. Darais, Yacc is dead, Available online at Cornell University Library archive (arXiv.org:1010.5023) (2010). [13] T. Parr, S. Harwell, K. Fischer, Adaptive LL(*) parsing: The power of dynamic analysis, in: Proceedings of the 2014 ACM SIGPLAN International Conference on Object-Oriented Programming

USA, 2014.

CR IP T

Systems Languages and Applications (OOPSLA’14), Vol. 579–598, ACM, ACM, Portland, OR,

[14] J. Janouˇsek, B. Melichar, The output-store formal translator directed by LR parsing, in: SOF-

SEM’97: Theory and Practice of Informatics, Vol. 1338 of Lecture Notes in Computer Science, Milovy, Czech Republic, 1997, pp. 432–439.

[15] B. Slivnik, LLLR parsing, in: Proceedings of the 28th Annual ACM Symposium on Applied Computing SAC’13, ACM, Coimbra, Portugal, 2013, pp. 1698–1699.

[16] G. Jaberipur, M. Dorrigiv, Formal ambiguity-resolving syntax definition with asserted shift reduce

AN US

sets, Scientia Iranica D 20 (6) (2013) 1939–1952.

[17] S. Sippu, E. Soisalon-Soininen, Parsing Theory, Volume I: Languages and Parsing, Vol. 15 of EATCS Monographs on Theoretical Computer Science, Springer-Verlag, Berlin, Germany, 1988. [18] S. Sippu, E. Soisalon-Soininen, Parsing Theory, Volume II: LR(k) and LL(k) Parsing, Vol. 20 of EATCS Monographs on Theoretical Computer Science, Springer-Verlag, Berlin, Germany, 1990. [19] A. V. Aho, J. D. Ullman, The Theory of Parsing, Translation and Compiling, Vol. Volume I: Parsing, Prentice-Hall, Englewood Cliffs, N.J., USA, 1972.

M

[20] A. V. Aho, J. D. Ullman, The Theory of Parsing, Translation and Compiling, Vol. Volume II: Compiling, Prentice-Hall, Englewood Cliffs, N.J., USA, 1973. [21] P. Purdom, C. A. Brown, Semantic routines and LR(k) parsers, Acta Informatica 14 (4) (1980)

AC

CE

PT

ED

299–315.

38