Microcomputers in linguistic data processing: context-free parsing

Microcomputers in linguistic data processing: context-free parsing

281 North-Holland Publishing Company Microprocessing and Microprogramming 9 (1982) 281-284 Microcomputers in Linguistic Data Processing: Context-Free...

248KB Sizes 1 Downloads 62 Views

281 North-Holland Publishing Company Microprocessing and Microprogramming 9 (1982) 281-284

Microcomputers in Linguistic Data Processing: Context-Free Parsing Ursula Klenk Seminar f~r Romanische Philologie der UniversitiR G6ttingen, Nikolausberger Weg 23, Gdttingen, F.R. Germany Progress in the development of microcomputers makes it possible to use these in the field of the processing of natural languages concerning syntactic analysis and semantics. Microcomputers prove to be a reasonable medium for the teaching of linguistic data processing in university courses. This paper reports on demonstration programs for context-free parsing which were written for a microcomputer. The programming languages are BASIC and PASCAL. An illustrating example is given.

Keywords: Natural language processing, syntactic analysis, context-free parsing, microcomputers, teaching, demonstration programs.

1. Introduction

In the university study of natural languages the field of linguistic data processing is becoming more and more important. Dealing with problems of automatic translation or artificial-intelligencesystems concerning the processing of natural languages, one always has to solve problems of syntactic analysis. In the investigation of contextfree grammars there have been proposed a lot of parsing-algorithms. Because context-free grammars play an important part in the syntactic description of natural languages, the teacher who teaches linguistic data processing will have the task to show to the students how these algorithms work and how they can be transformed into computerprograms. For our linguistic classes at the University of G6ttingen J. Mau and the author of this paper programmed several known parsingalgorithms in such a manner that they will run on a microcomputer. So they can well be utilized as demonstration and teaching material. First of all, it seemed important to show the principles of parsing (top-down, bottom-up, tabular methods).

Therefore we made programs which do this, but which in their majority have no further procedures which increase the efficiency of parsing. But such procedures could be inserted. Except being demonstration material the parsers serve as a base for further investigations as outlined in section 3.

2. The Parsers

Four programs run on a TRS-80 Model I (48 kbyte CPU, 3 mini-floppy-disks). The input is always a context-free grammar and a string (the input string) which is to be analysed. The output informs of whether the string is generated by the grammar or not. Following a short survey (the reader is supposed to be familiar with context-free parsing): 1) Method: general top-down-parser (without backtracking) working from left to right which in executing an expansion, checks all alternatives of a given left side of rule. Output: If the input string is generated by the grammar, all possible parse-trees for the string are given. Programming language: U.C.S.D.-PASCAL. Remarks: The problem of left-recursion is solved by a special device. The grammar must be e-free and cycle-free. 2) Method: top-down-parser with backtracking

[11. Output: if the input string is generated, its first left-derivation found.

Programming language: a BASIC dialect. Remarks: There is a procedure which shortens the parsing process by a sort of lookahead at the symbol of the input string which is to be found in the active stage of the analysis [3]. The problem of left-recursion is solved by a special device. The grammar must be e-free and cycle-free.

282

U. Klenk / Microcomputers in Linguistic Data Processing

3) Method: general bottom-up-parser executing all possible reductions [2]. Output: all reduction results. If the input string is generated by the grammar, the reductions lead to one or more strings consisting only of one occurence of the starting symbol of the grammar. Programming language: U.C.S.D.-PASCAL. Remarks: The grammar must be e-free and cycle-free. The algorithm is very slow and serves as a good example of a " b a d " algorithm in general. But it may be applied advantageously to special grammars which only have branching rules (rules with more than one symbol at the right side). 4) Method: Earley-Algorithm [1]. Output: the parse-tables for the input string, if it is generated by the grammar Programming language: U.C.S.D.-PASCAL. Remarks: The parser works relatively fast.

down-parser as in 1) which shows the results of the single parsing steps, but which does not construct the parse-trees. The second simulates a finite nondeterministic automaton. The programming language is a BASIC dialect. For the sake of simplicity of the programs the terminal and nonterminal symbols of the grammar and the input symbols and state symbols of the automaton have to be encoded in a special manner.

3. Applications The mentioned programs can be utilized for the study of the analysing properties of context-free grammars used in the field of natural language processing. We use some of the parsers as a base for further linguistic analysis concerning transformational grammars and procedural semantics. In this respect it is important that the programs written in PASCAL are portable to other machines

Two further programs run on a POLYMORPHIC 88 (16 kbyte CPU). The first is a general top-

TOPDOWN-PARSER (breadth-first,

Input:

i.

working

An £ - f r e e mlnal

2. An Output:

An

and

input

of

VT

string

information

left

is

ST

the program

ST

right)

contextfree

the

set

of

= a l . . . a n,

whether

of

to

cycle-free

symbols,

aerlvations

Parts

from

ST

(in t h e

system:

is

grammar

terminal

where

ai e

accepted

form

of

by

G =

VT

].

Input PARSER

of

the

3.

Procedures

G or not.

grammar

which

VT,

R is

R,

the

S),

set

where

of

VN

rules

is

and

the

set

of

S is

the

starting

nontersymbol

(I ~ i & n)

rule-lndices)

2.

(VN,

symbols,

If

and

and

output

it

all

is

accepted,

possible

preparation

the

the

history

parse-trees

forthe

of

for

all

left-

ST.

parsing

parse-trees

GRAMMAR-PREPARATION Input

of

Gramtestl:

the

grammar

cheeks

G

(rule-format:

i) w h e t h e r

T_Llst-Construction

('terminal terminal A -~ ~

Gramtest2:

i)

If G is

symbol symbols

= r and

If

there

that Output

of

the

are

there

grammar

rules

may

and

be

the

hand and

each

x which

can

be

(i.e.

it g i v e s

can

the

form (i.e.

to

right

hand

left

rule:

The in

terminal

the B

of

sides

T-11st

for

are

generated and

but

may

result

B E VN)

and

C --~

the

form

A ~

nontermina]

a rule

a derivation

information

parser,

(where

derivations

side)

hand

symbols

respective

by

A --~

all

generated

the the

be h a n d l e d

cycles

T-llsts

--~

for

~ = xa

of

side

2) w h e t h e r

list')

left-recursive,

(left-recurslveness 2)

left

G is E - f r e e

...

symbols

r ~ R contains

A --+ as

~

the

all

the

--~ ...-4

T, w h e r e

leftmost

symbols)

indicates

the

involved

in

long

A in

R,

Gramtest2

(~,~),

where

rules

parsing-processes) gives

a warning

--~ A)

datafiles

PARSER The

parser

empty)

constructs

string

over

tables

VN v

VT

of

and

the

following

~ a list

of

form:

A

rule-indices

table

t is

(i.e.

Fig. I.

a

list

of p a l r s

a derivation

history,

empty

at

~ is

a

(possibly]

initiallzation).

U. Klenk / Microcomputers in Linguistic Data Processing Input:

read granmlar

While

and its T - l i s t s

G

the p r o g r a m

Input:

from file

is not t e r m i n a t e d

reaa input

string

283

by

the user

ST String

rest:

are all s y m b o l s

of ST in VT?

yes (~number

of ST;

length

i:=

l;

(~index of the input

J:=

1

(* table- inoex ~)

Initialization: dhile

i <

While

Construct

n+l and t there

]

table t including J is not empty

are pairs

S

of symbols~)

n::

symbol w h i c h

the only pair

in tj, w h e r e

(A~,~)

is to be

(S,-),

fauna

where

A is a n o n t e r m l n a l

actually~)

S is the s t a r t i n g

symbol

of G

symbol

j:= j+l Expansion: I

For each pair

(A~,~)

one

if the s y m b o l

found do:

as the Elimination:

last pair

Eliminate

2) ~ b e g i n s

Input-Symbol-Check:

r of G w h i c h

a i of ST is in the T - l i s t

to tj, w h e r e

m is the r u l e - i n d e x

all pairs (~,~), where J of ~ is g r e a t e r than n-l+]

have

the

form A --~ ~

of r (T-list-check),

.

add

For each (~,~m)

of r

from t

I) the length I.

in tj_ 1 find the rules

with

a terminal

symbol

other

(length

restriction)

or

than a i

j:= j+l: For all pairs

(ai6,~)

in tj_ I add

(~,~)

to tj

(also if fl

is empty)

i:: i+i i = n+l?

I

Output:

yes Write

"String

Write

the h i s t o r y

(i.e.

the ~ ' s

Construct

(*In the case of

accepted";

in the pairs

the p a r s e

'strlng

Output:

of the l e f t - d e r i v a t i o n s

trees

accepted'

Write

String

position

not a c c e p t e d

at

i-l"

of tj); (tree-procedures)

the ~'s

in all pairs

of tj are empty b e c a u s e

of the

length-restrictions)

Fig. 2.

and can with slight modifications be executed on the great computer of our university.

4. Example An outline of the general top-down-parser written in U.C.S.D.-Pascal is given in Fig. 1 and 2, and an example of parsing with it in Fig. 3. The grammar generates a subset of German yes-no-question structures (See Table 1). The underlined symbols are the terminal symbols. S is the starting symbol. The symbols stand for the following grammatical categories: AD]: adjective; ADJG: group of adjectives; ADV: adverb; ADVB: adverbial phrase; APP: common noun phrase (Appellativum); ART: article; DET: determiner; HV: auxiliary verb; KONJ: conjunction; N: noun; NOMP: proper noun; NP: nominal phrase; NS: subordinate

clause (Nebensatz); NUM: numeral; PART: participle; PP: prepositional phrase; PRAE: preposition; PRON: pronoun; S: sentence (here: interrogative sentence); V: verb; VGI: complement of the verb (objects, adverbial phrases); VG2: complement of the verb (subordinate clauses). Input String: HV NOMP PART KONJ PRON ART N PRAE ART N V according to a question like: Hat Peter gesagt, dass er den Mann mit dem Fernglas sah? Readings: (1) Did Peter say that he saw the man who had the binoculars? (2) Did Peter say that he saw the man through the binoculars? Output: Trees in diagram form or in parenthesized form. The diagrams have the format shown in Fig. 3.

284

U. Klenk / Microcomputers in Linguistic Data Processing

Table 1 Rules

T-lists

S

A D J G - ADJ . A D J G - ADJ ADJG . ADVB - A D V . ADVB PP . APP - ADJG I%1.

ADJ ADJ ADV PRAE ADJ

HV Hat

APP -- ADJG N__PP . APP - N__. A P P - N_PP. DET - ART . D E T - NUM . N P - DET APP . N P - NOMP . N P - PRON . NS - KONJ NP V . NS - KONJ NP VG1 V__. P P - PRAE NP . S - - HV NP PART . S - HV NP PART VG2. S - HV NP VG1 PART. S - HV NP VG1 PART V G 2 . S V__NP. S - V__NP VG1. S -- V_NP V G 2 . VG1 - A D V 8 . VG1 - NP . VG1 NP A D V B .

ADJ N N ART NUM ART NUM NOMP PRON KONJ KONJ PRAE HV HV HV HV V V V ADV PRAE ART NOMP NUM PRON ART NOMP NUM PRON

VG2- NS.

KONJ

-

- -

-

i NPI PARTI VG2I I gesagt NOMP Peter

1 NS 1

KONJ da6

J NP

I

PRON er

I I NP

i

DET

I

I

ART den

N Mann

I PP PRAE mi t

NP 1

I

DET

APP

I

J

ART dem

N

Fernglas

S

i I 1 i HV

Hat

NP

PART

VG2

I gesagt NOMP Peter

1 NS 1

KONJ da5

I VGIl

NP

PRON

V sah

NP - -

DET

I

1974. [3] U. Klenk, J. Mau: Kontextfreie Syntaxanalyse mit einem Mikrocomputer. In: Sprache und Datenverarbeitung 3

I

APP

l.reading

ART den

References [1] A.V. Aho, J.D. Ullman: The Theory of Parsing, Translation and Compiling, Vol 1: Parsing. 1972. [2] R. Dietrich, W. Klein: Computerlinguistik. Eine EinfLihrung.

I V sah

VG 1

2. reading

ADVB

APP

I

N Mann

PP

I

PRAE mit

1 I DET

APP

ART dem

N Fernglas

NP

I

Fig. 3.

(1979). U. Klenk was born in Dresden (Germany) in 1943. She studied Romance Philology and Semitic languages and received the Dr. phil. degree from the University of G6ttingen. Since 1972 she has been Akademischer Rat in the Department of Romance

Philology, University of G6ttingen. During the last years her activities were related to mathematical linguistics, natural language processing and artificial intelligence. Her current research interests are in the field of syntactic analysis and procedural semantics.