A linguistic approach to information retrieval—II†

A linguistic approach to information retrieval—II†

A LINGUISTIC INFORMATION APPROACH TO RETRIEVAL-IIt PETR SGALL, JARM~LA PANEVOVA and SVATAVA ~ACHOVA Centre of Numerical Mathematics, Charles Univers...

596KB Sizes 1 Downloads 175 Views

A LINGUISTIC INFORMATION

APPROACH TO RETRIEVAL-IIt

PETR SGALL, JARM~LA PANEVOVA and SVATAVA ~ACHOVA Centre of Numerical Mathematics, Charles University, Praha I, Malostranske n. 25, Czechoslovakia (Received 6.iune 1975) Abstract-In this part of the study we want to characterize a framework for the synthesis of the answers delivered by the brain, for the transduction from the language of the brain into some natural language (cf. point (ii) from the beginning of Part I). We work here with the tectogrammatical language as the language of disambiguated information (see Part I) and with the Czech language as the target. We look for a method how to formulate algorithms useful also for the synthesis of Czech in Machine Translation where the point of departure for the synthesis also can be identical more or less with the semantic (tectogrammatical) representation. Our method of synthesis which will be described in this part is now experimentally verified and tested on a computer IBM 370.

The tectogrammatical representation of a sentence is translated (transduced) to the lower levels of language system, down to the graphemic one. An experimental version of generative rules has been used to derive automatically the semantic representations. The semantic representations generated by our experimental generative component differ slightly from those characterized in Part I, but they are based on the same theoretical principles (dependency approach, the distinction between a “structural” and a “linear” ordering, etc.). The generative component has the shape of a context-free phrase structure grammar; it is simplified from the empirical point of view in that the variation in communicative dynamism (the various types of topic/comment articulation of a sentence) is neglected and only one “deep word-order” for each type of construction is generated. Complex non-terminal symbols of the grammar are ordered n-tuples (X,, X2,. . . , X,) where Xi is the so-called name-symbol. i.e. a name shared by a certain class of non-terminal symbols of symbol-name. Similar that class, X2,. . . , X,, are indices specifying individual non-terminal structure characterizes a terminal symbol, as well. The (complex) terminal symbols correspond to so-called lexemes, indices, grammatemes. and functors. One name-symbol corresponds to the left bracket and one to the right one. The number of indices actually used differs with the individual name-symbols, the brackets and some other name-symbols having no indices at all. In the generative component there are several types of context free rules: the modifying, substitutional and selectional ones. They are shown illustratively in Fig. 1, where U, V, W are auxiliary non-terminal name-symbols, u is a terminal name-symbol, r is some functor, i.e. terminal symbol indicating (i) which of the two nonterminal symbols on the right-hand-side of the rule is the governing one and which is the dependent one, and (ii) the type of dependency (see below). The rules in the generative component are subject to a certain restriction, namely their right-hand side does not contain more than two non-terminal symbols so that we work with a binary context-free grammar. The generative component contains recursive rules; the order of the rules applied is determined by the form of the rule itself (by a selection of non-terminal symbols). The rules are not distinguished as obligatory and optional ones. However, the answer to the question as to whether it is necessary or only possible to use some rule in generating some semantic representation of a sentence, is given by the form of the rules itself. To make the representation brief and to increase the legibility, rule-schemes are largely made use of. The syntactic relations between complex symbols (each of which comprises a lexical unit together with its indices or grammatemes) are accounted for by means of functors denoted by R with a subscript, In the experimental version of the description six such functors are +This is the second part of the study “A Linguistic Approach to Information Retrieval”, written by P. Sgall and colleagues. Part I was published in ~n~~~m~tj~fl Storage and Retried Vol. 10, pp. 411-417 (Issue Nos. 11112, November/December, 1974). 147

modifying /rules

u

---

u

---I/

u

---u

w

/- W)

distinguished: R, for the relation of agent to the verb, R,, for that of patients, R, for the indirect object (dative), R,, for the so-called second object, R,, for the free (adverbial) expansions, and R, for a zero relation (as e.g. with the vocative case). In a more recent version of the tectogrammatical representation the functors are differentiated to a greater extent, the adverbial relationships (formerly denoted only by different grammatemes of the shape det,. accompanying the lexical symbol connected with the verb by R,,) being denoted by different functors, such as RI,,,, Rdlr. R..... etc. The grammatemes of the shape det, now account only for semantic variation inside a given adverbial (e.g. a noun connected with its governing verb by means of R,,, can be accompanied by one of the grammatemes det,,,,,,,,, det,,,,,,,,, det ,lell,,,.etc.: the functors for temporal adverbial can be combined with a semantic variation comprising simultaneity, precedence and subsequence). The expansions of the verb are divided into inner and free ones; the former are either obligatory or optional. On the base of whether a verb has a given expansion as an inner one (included in its verbal frame, in Fillmorian terms), whether this expansion is obligatory or optional, and to what type of expansion (with what functor) it belongs. we distinguish, in the experimental version, I5 classes of Czech verbs. The classes are mutually disjunct (so that a partition on the class of all described verbs is defined). and the membership of a verb in a class is denoted by the value of one of the indices accompanying the lexical symbol of the verb. Some syntactic pairs are generated, in this version, without regard to selection restrictions (e.g. the verb and most of its free expansions, or the noun and its adjunct), in other cases such restrictions are respected (e.g. the verb and its inner expansions, or cases: an adjective and an adverbial dependent on it). Thus a first approximation to a more detailed classification of semantic word classes has been achieved. The membership of a lexical symbol in certain classes corresponding to a possible occurrence with other lexical symbols in syntactic pairs is denoted by indices and their values. Thus, the set of semantic representations of sentences generated by this variant of the functional generative description of Czech will not include such sequences of symbols as, e.g. those that would underlie such non-acceptable syntactic pairs as *the table contemplutes or ‘“cunre during tows. since such verbs as contemplate are equipped with an index requiring an animated agent. and the lexical symbol for town is marked as not being able to function as a free temporal adverbial (with any verb). The translation of the semantic representation to the graphemic level, i.e. its transduction to a sequence of Czech word-forms is realized by means of several steps. In the first of them, the semantic representation is translated to the phenogrammatical (surface syntactic) level; the functors, corresponding to the relations between a governing and a dependent lexical item at the tectogrammatical level. are changed here into symbols (grammatemes) for sentence parts: in this connection, the choice between active and passive is made. as well as between a dependent clause, an infinitive, a nominalization. etc.; the semantic functions (meanings) are substituted by the means realizing them at the next lower level. The second step consists in the translation of the representation to the morphemic level; the meanings of place. time, cause etc. are changed here into the morphemic forms realizing them (prepositional phrases etc.): furthermore, the morphemic units of tense. aspect. gender, number etc. are chosen here. The rules of morphemic synthesis translate then these morphemic representations into sequences of Czech word forms (with case inflections, personal endings etc.) corresponding to grammatical sentences: at last, the graphemic shape of a sentence is achieved. which expresses the meaning that was represented by the given input string of the transductive components. The sequence of computer programs performing this transduction is based on the formal pattern of pushdown transducers. The main program of each step is constructed on the basis of

A linguistic approach to information retrieval-11

149

the defining function of such a transducer (see SGALL et a/. [l], pp. 40f, 60f, 76ff), where by means of a single passing through the given representation of a sentence the changes necessary for the transduction to the next lower level are ensured, while every dependent word (rectum) is confronted here with its governing word (regens). The results of such a confrontation (first of all, modifications of the dependent word according to relevant properties of its governor) are given in the form of large tables, represented as subroutines of the main program, activated always when the two members of a syntactic pair are confronted, one of them being read at the input of the transducer and the other being at the accessible end of the pushdown store. The large size of the tables makes it necessary to have specific subroutines (a) for the identification of the types of word forms figuring as names of rows and (b) as names of lines and (c) for the identification of the result found in the table (at the crossing of the given line and row), i.e. the value of the function for the given values of its two arguments. These operations represent the most extensive part of the functioning of the computer (which also is connected with the largest requirements on its memory). The formal mechanism used is based on rather strong linguistic hypotheses: (I) For the choice of means (i.e. units of the next lower level) realizing or expressing a given functional unit (of the higher level), it is enough to confront a pair of complex symbols (word forms) connected by syntactic dependency (with the exception of individual more complex constructions); the dependent item changes its characteristics according to relevant properties of its governing item. The necessary contextual data can be rendered by means of an auxiliary transducer (automaton I), which, first of all, changes the order of the complex symbols in the semantic representation to an order of the type “regens post rectum”. so that the confrontation of the members of each syntactic pair is made possible. (2) During the work of a transducer, it is sufficient to pass the representation of the sentence only once. (3) When the dependent item is being transduced to the next lower level, its governing item has already been transduced to this level. It has been possible to construct algorithms for the synthesis of Czech sentences, respecting the quoted assumptions. The formal shape of the rules makes it possible to interpret the rules as rendering the relationship of “function” (meaning) and “form” (means), well known from the Prague school of structural linguistics (see e.g. VACHEK [2]); also such notions as KURYLOWICZ’S[3] primary and secondary functions of a form can be studied in a more explicit way when using such a formalism. We want to illustrate the functioning of our description by a Czech sentence including only such lexical elements that figure in the dictionary used during the present experiment; at the input of the transductive components there is the semantic representation, the transduction of which leads to the following Czech sentence: iena,

kterLi u stolu Cte, je velkri. (The woman

We can also get, at the output, the paraphrase reading

that is reading pena

at the table is tall.)

Ctouci IA stolu je velkd. (The woman

at the table is tall.)

From the tables we are working with in the present experiment, we have chosen only the relevant parts to be presented here, i.e. those that are necessary for the transduction of our example. The procedure of transduction of the input sequence of symbols. ensured by an application of the main algorithm (corresponding to the functioning of the pushdown transducer), is equivalent to a procedure using the comparison of nodes of a dependency tree characterizing the given sentence. For the sake of clarity, it is possible to use the dependency tree, while the transduction itself (as defined by our algorithms and programs) works with linear representations in which the lines of the tree are substituted by pairs of parentheses. The codes used in the grammar are substituted here, for the ease of the reader, in part, by abbreviations which might be more comprehensible. As can be seen from the example, the tables are divided into several parts; this is necessary, first of all, due to their size, and, secondly, due to the generalization of some rules (which permits not to repeat many times some rules that can be stated only once; for instance the choice of active and passive pertains to every occurrence of a verb, in whatever syntactic position, so that it has been useful to formulate this choice in a specific part of the table, so that it is not necessary to repeat the relevant rules for the main verb, for a verb functioning as the agent or the patient of

150

P.

SGALI

rt

(I/.

Table I.

V j=ag.pat,i,n,det ::7(0),6(det) AD:2(q),3(r),5(~),6(y):a(n) co D 1:; a(j) _______ ___________________ _______------___-_----_-~-~-_--_-~----~' --_-________-_t-------V:;14.Dl:;a(j)-99 1V I t i / COLC:;14.AD:2(q),3(r),5(~),6(~);8(n) 2 co

regens

3

N:;14.V::7(0),a(det!

N

V:;Y(akt) CO:;15.N:O(i)-7(j);

rectm!

GVA

.._._____

i”J

wart

h:;a(sb)

N:;a(sb) 01:0(i)-7(j); a(sb)-kter6

co(*)

_.__________.

A.

c

3 V:21(0); 2(procees),7(0), a(det).14.D1:; B(w3)-9Y

ALVL:;6(atr) ;a(atr)

v:

0.5 0,5

4 N:;7(q), a(det) q=...,24,... m:;a(ps)

5 AD:;8(n)

Table 2B. any word

_____,_-_-___-______________ . v

V..Y(akt) ..

another verb, i.e. as the head of an embedded sentence, etc.). The passing through the individual parts of the table is determined by the “control algorithm” given in the Appendix. Table 1, characterizing a part of the processing of Transducer I, has an auxiliary character (we do not render here the change of a functor to a grammateme-the 8th grammateme of any dependent item; the change of the order of units, mentioned above, is ensured by the defining function of the transducer): here, the necessary adjustments of the governing item according to relevant features of the items dependent on it are made (as far as they represent context data necessary for the transduction of the governing item to the next lower level). Transducer II and Table 2 ensure the proper transduction to the phenogrammatical level (i.e. to the level of surface syntax). Here. the name of the column corresponds to the governing item, the line being assigned to the dependent item, which is modified by this table. Transducer III is of a more complex character, since here in each pair first the governing item is modified according to its dependent items, and then the dependent item is changed after the features of its governor, so that both the members are transduced to the morphemic level; then, the governing word form is again modified (see Table 3B) according to those dependent on it. This second confrontation is necessary since the categories of concord (number, gender) can be transferred from the subject to the verb only after the morphemic shape of the subject lexeme has been chosen (cf. the noncongruence of meaning and form with the number of such nouns-pluralia tantum-as scissors, clippers). Transducer IV has again an auxiliary character, its main aim being the change of the order of items from the order “regens post rectum” to the surface word order (it corresponds to the order generated by the grammar, which can be interpreted as the scale of communicative dynamism, similar to Chomsky’s range of permissible focus: but the surface word order differs from this scale in some points given by the rules of Transducer IV and concerning the grammatically determined points of word order, which are by far not so numerous in Czech as they are in

_. I

A linguistic approach to information

retrieval-II

Table 3A.

regens

rectum 1

co(*) __--__--_-_-_-_

any word .-_--_-_-____

AD:;8(Nom)

AD:;e(pn)

12(!),13(!j

X:;B(Nom)

2 X:;B(sb) X=N,Dl,NlJM,... 3 X:;7(where 4 B(atr/adv/pnI* 4 ADVA/ADVP:. 7(0),EQatrj

ADVA/ADVP:;B(p), 12(x),13(~)

5 V/CO(*):;7(0); B(atr)

-

Table 3B

N:;B(ab),l2(x),U(q) ____-______---___-_- _-_-_____---_----------------

regem

1 V/CO(*):;lO(inf) 15.Ii:;12(~),13(~! 2 V/CO(%):;lO(inf) (Ffi)-denotes

“not Infinitive”

Table 3C. regens V/CO(*):;e(atr),l5.N:;12(x),13(y) Dl:;lZ(x),l3(y)-kter&

Table 3D. E. F. V/CO:;2(impf),3(99)

D 1: V/CO:;2(process),3

ADVA:;2(99),3(99)-L(P?3)

ADVA:;2(process),3~$~~ter

V/CO:;4(pr+Sz),?(99)

D 3; V/CO:;4(simult), 5{isEided ADVA:;4(x),S(y), x,Y=aW

value

ADVA:;4(93),5(99) V/CO:;6(indik) ADVA:;6(99)

E:

V,co:;6(0) ADVA:;6(0)

F:

X:0(0),7(y) X=N, PRONN, PROW

x:;12(sg),13(y) y = 0 masc.am. 1 masc.nw-

X:;l2(pl),l3(y)

X:0(1,2),7(y)

Table 4. regens rectum co(*):;12(x),13(y) any word _________________-___ ____--_-____-------- -----------------------AD:;12(1),13(y)

AD:;12(!),13(!) V/co(X):;4(nimult/ posterior), YJ;;;) 1 10(0) V,CO:;1(99) root

V/CO:;l,(YY) I full-stop at the end of the sentence

English). Furthermore, Transducer IV assigns proper position to prepositions, conjunctions, auxiliary and modal verbs, as well as to punctuation marks, not to speak about some other minor modifications which make the output string more appropriate for the rules of morphemic synthesis to operate on. These rules (which can be defined as a finite transducer, since the context is relevant only to a very limited extent, at this level) substitute then sequences of letters for all the symbols of morphemic units, so that the orthographical form of the sentence is achieved.

152

P. SGUI

Semantic (tectogrammatical)

representation

PI trl.

(input for 1st automaton,

coded for the IBM 370)

# ) > 8 AD VELKEi I. 0. 0, 0. 0. 0. I. M; I. 99. 99. 99, 99. 99. 99. 99 & KNC 8 CO BY/T 0: 99, 0. 1, 0, 1, 99.99, 99 & RA > ) ) 5 V C *T, 1, 10. 1.0. 15, 15,0.7,0. G1,0.0.0,0.0.0.0,0.0. 0,0,0,0, l.3.1.0~1.1,3:99.0.I,0,0,99.0.99&RD8NST0L,0,0,~,0.0.0.0,l:99.99 99, 99, 99, 99. 14, 99 & RA 8 DI : & RDC 8 N Z * EN 0. 0. 0, 0. 0, 0, 0. 2: # The simplified representation

of the sentence

CO

(in the shape of dependency

tree)

?-(process). 3(noniter),

R?T: p,,,,,,,

4(simult),

S(extended),

h(0)

K., ‘\

/N SN:

sg:-

-AD

\

;;\,> K.8 l

//

On the output of the 1st automaton

V CT:

VELKI?:

:I(poz)

Z(process). 3(noniter), J(simult), S(immed). 6(O). 7(O)

K <’ i N STOL:

sg: 7(where4)

we get the following representation:

# ( ( 8 AD 0 VELKE/ 1, 0.0,0,0, 0, 1. 0: I, 99.99.99,99.99,99.4 ( ( ( S N STOL (1.0.2.0. 0,0.0, I : 99, 99.99.99,99.99. 24, 2 8 D I : 99,99,99. 99399. 99, 99, 0 8 V @ C * T I. IO. I, 0. 15, 15.0,7,0.0.0,0.0,0.0.0.0.0.0,0,0,0.0, I, 3. I. 0, I, I. 3; 99-0. 1.0.0,99.0.2. 14.Dl: S(0)-99 8 N Z * EN 0.0, 0,0,0,0,0 , 2: 99. 99. 99. 99, 99. 99, 0, 14.V:: 7(0). 8(2) 8 CO *:BY/T 0; 99, 0, 1, 0, I. 99, 99, 99, 14.AD: 2(O), 3(O). 5(O), 6(l); 8(4) # Note. 99-symbol for an empty position, O-trace of the word-form which has been connected with the governing word by a functor with a prime (i.e. it stands to the right from the governing word): a semicolon stands in between the indices of the word-form and its grammatems. On the output of the 2nd automaton we get the following representation:

(9 # 8 N Z * EN 0.0.0.0.0.0, 0, 2: 99,99,99.99.99.99.0. 14.v:: 7(0). x(2) 8 v 0 C 2::T I. lo. I. 0. 15. 15. 0. 7. 0, 0. 0. 0, 0. 0, 0, 0. 0. 0, 0.0, (1.0.0, I. 3, I. 0. I. I. 3: 99,B. I. 0.99.0. 2.0, 14.Dl: S(O)-99 8 Dl KTEREi 0,M. 0.0.0,0,0.2; 99.99.99.99.99.99.0. 8 N STOL 0.0. 2.0. 0T0.0. 1: 99.99.99.99.99.99. 24. 2 ) ) ) 8 AD 0 VELKE/ I, 0,0.0.0.0. I, 0: I. YY.99. YY,99. 99. 99, 6 ) ) 8 CO * BY/T 0: 99. 0. I. 0. 1. 99. 99. 99. 14.AD: 3(0). J(0). 5(0). 6(I): X(4) # (ii) # 8 N Z * EN 0. 0. 0.0.0,0.0, 2: 99, 99.99.99. 99.99,0. 14.V:: 7(O). 8(2) 8 ADVA 0 C 2::T I. 10, I, 0. IS, IS, 0, 7, 0.0.0,0,0.0,0,0.0.0,0.0,0.0.0, I, 3. 1.0. I. I. 3: 99.0, I. 0, Y9.0. 7. 14.Dl: 8(O)-99 8 ‘1 8 N STOL 0. 0, 2, 0. 0. 0. 0. I : 99. 99. 99. 99. 99, 99. 24. 2 ) ) ) 8 AD I$; VELKE/ 1,0. 0, 0, 0. 0. I. 0: I, YY. 99. 99. 99, 99. 99. 6 ) ) S CO 1: BY/T 11:99, 0. I. 0. I. YY. 99. 99, 14.AD: 2(0). 3(0). S(0). 6( I ): X(4) #

The representation je celka’

(The

representation table is tall).

of the Czech sentence ienu. ktmi II stole hp. the representation (ii) is a syntactic variant .&rln II stol[r i.tolrc.i je relkb (The \l’otna~ rcwding at the

(i) is a syntactic representation

uwna~

that

reads

of a synonymous

at the table

is tall),

A linguistic

Control

is V CO B root of t h e tree?,

Control

governing

governed

element element

algorithm

(-1 Table

---leE-----+

according according

approach

II B

to information

for automaton

II

______-________________+ nextrcward

algorithm

for automaton

to the governed

element:

to ita

153

retrieval--II

governing

element

III

(TableIII) (including

ths root):

lW,NAD.ADVA.ADVP?--

The output of the 3rd and the 4th automaton

is the following morphemic

representation:

(9 # 8 N Z * EN; 99, 99. 99, 99, 99,99,99, I, 99.99. 99.0, 2. S Dl KTERE/; 99,99,99,99, 99.99, 1,99.99,99,0, 2 S PREP U S N STOL 99,99,99,99,99,99,24. (56,2), 99,99.99.0, V C * TE 99,0,99.0.99,0,99,2,0.99,2,0,99 8 CO * BY/T 99.0,99,0.99,0.99,99.0,99.2. 0. 2 S AD VELKE/ I, 99, 99, 99, 99. 99, 99, 1, 99. 99, 99, 0. 2 #

99. I8

(ii) # S Z * EN; 99, 99, 99, 99,99, 99. 1.99. 99, 99.0, 3 S PREP U S N STOL 99,99,99,99,99, 99,24. (56.2) 99,99.99,0, 1 S ADVA C * TOUCI/ 99,99,99,99,99,99,99, I, 99,99,99.0,2 S CO * BY/T 99,0.99,0,99,0.99,99,0,99,2,0,2 8 AD VELKE/ I, 99,99,99.99,99,99, I, 99, 99. 99, 0, 2 # The order of the word-forms in this representation corresponds to the topic-comment articulation; auxiliary words (prepositions) are treated here as individual elements; the word-forms lost their indices (some of them were changed into grammatemes, e.g. number) and also the complex auxiliary 14th grammateme disappeared. REFERENCES [I] P. SGALL, I.. NEBESK?. A. GORAL~~KOVA and E. HAJIWA. Functional Approach to Syntax. New York (1969). [2] .I. VACHEK: The Linguistic School of Prague, Indiana. University Press, Bloomington. Indiana (1966). [3] J. KLJRY+OWICZ: Le problime du classement des cases. Biuletyn Polskiego Towarzystwa Jrzykoznawczego 9, 1W3 (1949). see also his Esquisses Linguistiques I. Miinchen (1973).