281 North-Holland Publishing Company Microprocessing and Microprogramming 9 (1982) 281-284
Microcomputers in Linguistic Data Processing: Context-Free Parsing Ursula Klenk Seminar f~r Romanische Philologie der UniversitiR G6ttingen, Nikolausberger Weg 23, Gdttingen, F.R. Germany Progress in the development of microcomputers makes it possible to use these in the field of the processing of natural languages concerning syntactic analysis and semantics. Microcomputers prove to be a reasonable medium for the teaching of linguistic data processing in university courses. This paper reports on demonstration programs for context-free parsing which were written for a microcomputer. The programming languages are BASIC and PASCAL. An illustrating example is given.
Keywords: Natural language processing, syntactic analysis, context-free parsing, microcomputers, teaching, demonstration programs.
1. Introduction
In the university study of natural languages the field of linguistic data processing is becoming more and more important. Dealing with problems of automatic translation or artificial-intelligencesystems concerning the processing of natural languages, one always has to solve problems of syntactic analysis. In the investigation of contextfree grammars there have been proposed a lot of parsing-algorithms. Because context-free grammars play an important part in the syntactic description of natural languages, the teacher who teaches linguistic data processing will have the task to show to the students how these algorithms work and how they can be transformed into computerprograms. For our linguistic classes at the University of G6ttingen J. Mau and the author of this paper programmed several known parsingalgorithms in such a manner that they will run on a microcomputer. So they can well be utilized as demonstration and teaching material. First of all, it seemed important to show the principles of parsing (top-down, bottom-up, tabular methods).
Therefore we made programs which do this, but which in their majority have no further procedures which increase the efficiency of parsing. But such procedures could be inserted. Except being demonstration material the parsers serve as a base for further investigations as outlined in section 3.
2. The Parsers
Four programs run on a TRS-80 Model I (48 kbyte CPU, 3 mini-floppy-disks). The input is always a context-free grammar and a string (the input string) which is to be analysed. The output informs of whether the string is generated by the grammar or not. Following a short survey (the reader is supposed to be familiar with context-free parsing): 1) Method: general top-down-parser (without backtracking) working from left to right which in executing an expansion, checks all alternatives of a given left side of rule. Output: If the input string is generated by the grammar, all possible parse-trees for the string are given. Programming language: U.C.S.D.-PASCAL. Remarks: The problem of left-recursion is solved by a special device. The grammar must be e-free and cycle-free. 2) Method: top-down-parser with backtracking
[11. Output: if the input string is generated, its first left-derivation found.
Programming language: a BASIC dialect. Remarks: There is a procedure which shortens the parsing process by a sort of lookahead at the symbol of the input string which is to be found in the active stage of the analysis [3]. The problem of left-recursion is solved by a special device. The grammar must be e-free and cycle-free.
282
U. Klenk / Microcomputers in Linguistic Data Processing
3) Method: general bottom-up-parser executing all possible reductions [2]. Output: all reduction results. If the input string is generated by the grammar, the reductions lead to one or more strings consisting only of one occurence of the starting symbol of the grammar. Programming language: U.C.S.D.-PASCAL. Remarks: The grammar must be e-free and cycle-free. The algorithm is very slow and serves as a good example of a " b a d " algorithm in general. But it may be applied advantageously to special grammars which only have branching rules (rules with more than one symbol at the right side). 4) Method: Earley-Algorithm [1]. Output: the parse-tables for the input string, if it is generated by the grammar Programming language: U.C.S.D.-PASCAL. Remarks: The parser works relatively fast.
down-parser as in 1) which shows the results of the single parsing steps, but which does not construct the parse-trees. The second simulates a finite nondeterministic automaton. The programming language is a BASIC dialect. For the sake of simplicity of the programs the terminal and nonterminal symbols of the grammar and the input symbols and state symbols of the automaton have to be encoded in a special manner.
3. Applications The mentioned programs can be utilized for the study of the analysing properties of context-free grammars used in the field of natural language processing. We use some of the parsers as a base for further linguistic analysis concerning transformational grammars and procedural semantics. In this respect it is important that the programs written in PASCAL are portable to other machines
Two further programs run on a POLYMORPHIC 88 (16 kbyte CPU). The first is a general top-
TOPDOWN-PARSER (breadth-first,
Input:
i.
working
An £ - f r e e mlnal
2. An Output:
An
and
input
of
VT
string
information
left
is
ST
the program
ST
right)
contextfree
the
set
of
= a l . . . a n,
whether
of
to
cycle-free
symbols,
aerlvations
Parts
from
ST
(in t h e
system:
is
grammar
terminal
where
ai e
accepted
form
of
by
G =
VT
].
Input PARSER
of
the
3.
Procedures
G or not.
grammar
which
VT,
R is
R,
the
S),
set
where
of
VN
rules
is
and
the
set
of
S is
the
starting
nontersymbol
(I ~ i & n)
rule-lndices)
2.
(VN,
symbols,
If
and
and
output
it
all
is
accepted,
possible
preparation
the
the
history
parse-trees
forthe
of
for
all
left-
ST.
parsing
parse-trees
GRAMMAR-PREPARATION Input
of
Gramtestl:
the
grammar
cheeks
G
(rule-format:
i) w h e t h e r
T_Llst-Construction
('terminal terminal A -~ ~
Gramtest2:
i)
If G is
symbol symbols
= r and
If
there
that Output
of
the
are
there
grammar
rules
may
and
be
the
hand and
each
x which
can
be
(i.e.
it g i v e s
can
the
form (i.e.
to
right
hand
left
rule:
The in
terminal
the B
of
sides
T-11st
for
are
generated and
but
may
result
B E VN)
and
C --~
the
form
A ~
nontermina]
a rule
a derivation
information
parser,
(where
derivations
side)
hand
symbols
respective
by
A --~
all
generated
the the
be h a n d l e d
cycles
T-llsts
--~
for
~ = xa
of
side
2) w h e t h e r
list')
left-recursive,
(left-recurslveness 2)
left
G is E - f r e e
...
symbols
r ~ R contains
A --+ as
~
the
all
the
--~ ...-4
T, w h e r e
leftmost
symbols)
indicates
the
involved
in
long
A in
R,
Gramtest2
(~,~),
where
rules
parsing-processes) gives
a warning
--~ A)
datafiles
PARSER The
parser
empty)
constructs
string
over
tables
VN v
VT
of
and
the
following
~ a list
of
form:
A
rule-indices
table
t is
(i.e.
Fig. I.
a
list
of p a l r s
a derivation
history,
empty
at
~ is
a
(possibly]
initiallzation).
U. Klenk / Microcomputers in Linguistic Data Processing Input:
read granmlar
While
and its T - l i s t s
G
the p r o g r a m
Input:
from file
is not t e r m i n a t e d
reaa input
string
283
by
the user
ST String
rest:
are all s y m b o l s
of ST in VT?
yes (~number
of ST;
length
i:=
l;
(~index of the input
J:=
1
(* table- inoex ~)
Initialization: dhile
i <
While
Construct
n+l and t there
]
table t including J is not empty
are pairs
S
of symbols~)
n::
symbol w h i c h
the only pair
in tj, w h e r e
(A~,~)
is to be
(S,-),
fauna
where
A is a n o n t e r m l n a l
actually~)
S is the s t a r t i n g
symbol
of G
symbol
j:= j+l Expansion: I
For each pair
(A~,~)
one
if the s y m b o l
found do:
as the Elimination:
last pair
Eliminate
2) ~ b e g i n s
Input-Symbol-Check:
r of G w h i c h
a i of ST is in the T - l i s t
to tj, w h e r e
m is the r u l e - i n d e x
all pairs (~,~), where J of ~ is g r e a t e r than n-l+]
have
the
form A --~ ~
of r (T-list-check),
.
add
For each (~,~m)
of r
from t
I) the length I.
in tj_ 1 find the rules
with
a terminal
symbol
other
(length
restriction)
or
than a i
j:= j+l: For all pairs
(ai6,~)
in tj_ I add
(~,~)
to tj
(also if fl
is empty)
i:: i+i i = n+l?
I
Output:
yes Write
"String
Write
the h i s t o r y
(i.e.
the ~ ' s
Construct
(*In the case of
accepted";
in the pairs
the p a r s e
'strlng
Output:
of the l e f t - d e r i v a t i o n s
trees
accepted'
Write
String
position
not a c c e p t e d
at
i-l"
of tj); (tree-procedures)
the ~'s
in all pairs
of tj are empty b e c a u s e
of the
length-restrictions)
Fig. 2.
and can with slight modifications be executed on the great computer of our university.
4. Example An outline of the general top-down-parser written in U.C.S.D.-Pascal is given in Fig. 1 and 2, and an example of parsing with it in Fig. 3. The grammar generates a subset of German yes-no-question structures (See Table 1). The underlined symbols are the terminal symbols. S is the starting symbol. The symbols stand for the following grammatical categories: AD]: adjective; ADJG: group of adjectives; ADV: adverb; ADVB: adverbial phrase; APP: common noun phrase (Appellativum); ART: article; DET: determiner; HV: auxiliary verb; KONJ: conjunction; N: noun; NOMP: proper noun; NP: nominal phrase; NS: subordinate
clause (Nebensatz); NUM: numeral; PART: participle; PP: prepositional phrase; PRAE: preposition; PRON: pronoun; S: sentence (here: interrogative sentence); V: verb; VGI: complement of the verb (objects, adverbial phrases); VG2: complement of the verb (subordinate clauses). Input String: HV NOMP PART KONJ PRON ART N PRAE ART N V according to a question like: Hat Peter gesagt, dass er den Mann mit dem Fernglas sah? Readings: (1) Did Peter say that he saw the man who had the binoculars? (2) Did Peter say that he saw the man through the binoculars? Output: Trees in diagram form or in parenthesized form. The diagrams have the format shown in Fig. 3.
284
U. Klenk / Microcomputers in Linguistic Data Processing
Table 1 Rules
T-lists
S
A D J G - ADJ . A D J G - ADJ ADJG . ADVB - A D V . ADVB PP . APP - ADJG I%1.
ADJ ADJ ADV PRAE ADJ
HV Hat
APP -- ADJG N__PP . APP - N__. A P P - N_PP. DET - ART . D E T - NUM . N P - DET APP . N P - NOMP . N P - PRON . NS - KONJ NP V . NS - KONJ NP VG1 V__. P P - PRAE NP . S - - HV NP PART . S - HV NP PART VG2. S - HV NP VG1 PART. S - HV NP VG1 PART V G 2 . S V__NP. S - V__NP VG1. S -- V_NP V G 2 . VG1 - A D V 8 . VG1 - NP . VG1 NP A D V B .
ADJ N N ART NUM ART NUM NOMP PRON KONJ KONJ PRAE HV HV HV HV V V V ADV PRAE ART NOMP NUM PRON ART NOMP NUM PRON
VG2- NS.
KONJ
-
- -
-
i NPI PARTI VG2I I gesagt NOMP Peter
1 NS 1
KONJ da6
J NP
I
PRON er
I I NP
i
DET
I
I
ART den
N Mann
I PP PRAE mi t
NP 1
I
DET
APP
I
J
ART dem
N
Fernglas
S
i I 1 i HV
Hat
NP
PART
VG2
I gesagt NOMP Peter
1 NS 1
KONJ da5
I VGIl
NP
PRON
V sah
NP - -
DET
I
1974. [3] U. Klenk, J. Mau: Kontextfreie Syntaxanalyse mit einem Mikrocomputer. In: Sprache und Datenverarbeitung 3
I
APP
l.reading
ART den
References [1] A.V. Aho, J.D. Ullman: The Theory of Parsing, Translation and Compiling, Vol 1: Parsing. 1972. [2] R. Dietrich, W. Klein: Computerlinguistik. Eine EinfLihrung.
I V sah
VG 1
2. reading
ADVB
APP
I
N Mann
PP
I
PRAE mit
1 I DET
APP
ART dem
N Fernglas
NP
I
Fig. 3.
(1979). U. Klenk was born in Dresden (Germany) in 1943. She studied Romance Philology and Semitic languages and received the Dr. phil. degree from the University of G6ttingen. Since 1972 she has been Akademischer Rat in the Department of Romance
Philology, University of G6ttingen. During the last years her activities were related to mathematical linguistics, natural language processing and artificial intelligence. Her current research interests are in the field of syntactic analysis and procedural semantics.