In/arm. Sfor. Relr.
Vol. 1, pp. 19-28.
Pergamon Press 1963.
AN EXPERIMENT
Printed in Great Britain.
IN MECHANICAL
TRANSLATION
K. H. V. BOOTH INTRODUCTION idea of using an electronic computing machine to translate from one language to another has received considerable attention during the last eight years or SO, the main centres of interest being in Russia, America and England. Much has been written on the linguistic aspects of the problem, but there has been comparatively little published on actual experiments using a machine. One reason for this is that, until comparatively recently, no electronic computer possessed a store adequate to contain anything like a complete dictionary for translation. Even now, although larger stores are available, the time required for access to any particular entry still makes mechanical translation not too impressive from the point of view of speed, and moreover machines with such a storage are not always available to those interested in the project of translation. At Birkbeck College, London, where one of the earliest groups was formed for work on M.T., the view was adopted that actual experiment, even with a severely restricted dictionary, could give useful information on the soundness of any scheme proposed and the present paper describes a programme and dictionary used to try out translation from French into English. The experiment was conducted early in 1958 and this report was submitted in 1959. LINGUISTICS THE
The patterns of sentence structure in French and English are much more nearly identical than in, say, English and German, and this was one reason for the choice of this pair of languages for experiment, since the storage available, and therefore the possible programme complexity, was very limited. An analysis of the mechanism of French to English translation showed that the following are the basic rules which must be observed. 1. When a noun is followed by an adjective in French, the adjective must be printed first in English. 2. An oblique pronoun followed by a verb must be printed in the reverse order in English, e.g. L’homme nous donne le livre-The man gives us the book. 3. A nominative pronoun (pl) followed by a second pronoun (p2) and a verb (v) must be printed p1 vp2, e.g. je le donne-I give it. 4. Two oblique pronouns followed by a verb (p,p2v) are printed as vp,p,, e.g. L’homme le lui donne-the man gives it to him. 5. The configuration plp2v, where p1 is nous or vous is printed as p,vp, when the verb agrees with pl, and vp,p, when it does not, e.g. nous le donnons-we give it; l’homme nous le donne-the man gives us it. 6. Three pronouns followed by a verb p1p2p3v are printed as plvp,p,, e.g. nous le leur donnons-we give it them. 19
20
K. H. V. BOOTH
In addition
the following
refinements
were introduced
to make the text more readable.
7. Un or une followed by a noun or adjective is translated as ‘a’ or ‘an’ according as the following word begins with a consonant or a vowel. 8. Un or une not followed by an adjective or noun is translated as ‘one’. 9. Le, la or les followed by a noun or adjective is translated as ‘the’, otherwise as him/it, her/it and them respectively. In order to minimize the number of dictionary entries required, the scheme of storing ‘stems’ and ‘endings’ in separate sections was adopted. Briefly, this consists of dividing words which may take different forms into a stem, which is the longest segment common to all forms, and a series of endings, most of which will, of course, be attachable to many different stems. Thus the verb parler will have as stem pad and as endings in the present tense, for example, -e, -es, -ens, -ez, -ent. Irregular verbs may require several stems, and sometimes it may be necessary to store individual parts since no stem can conveniently be isolated. The present tense of &treis such an example. On the whole, however, the use of the stem-ending method cuts down the size of the dictionary considerably. It may be remarked at this point that the translation of the ending may be a prefix, a suffix, both or neither, and these cases must be distinguished by suitable code numbers in the dictionary.
THE COMPUTER:
INPUT
AND
OUTPUT
OF THE TEXT
The computer used in this experiment was the MAC at Birkbeck College, London. This machine is almost identical with the APEXC [l], and has a simple order code of 15 instructions and a store for 1024 words of 32 bits each. Only 960 of these were, however, actually available for the programme and dictionary. They are arranged in 30 ‘tracks’ each containing 32 words. Addition and subtraction take 0.6 msec and the average access to any word in the store is 1Omsec. Input and output are via 5-hole punched paper tape, and the associated teleprinter equipment uses the simple code A = 1, B = 2, etc. The only punctuation available is ‘.’ and 0’) and accents are represented by arbitrarily assigned letters. Thus acute, grave and circumflex accents are represented by 01, 02 and 03 respectively. These are typed after the letter concerned. Thus avec will appear in the code as 1, 22, 5, 3 and de’ as 5, 01, 20, 5, 01. The text to be translated is typed on a teleprinter punch in the usual way, with one space between each word, and a full stop at the end of each sentence. Punctuation, other than commas, is ignored or replaced by a full stop. The end of the text is indicated by typing ‘end’. For the purposes of the experiment sentence-length was restricted to 16 words, and word-length to 12 letters. There is no limit on the text-length. Thus the text is presented to the computer as a series of numbers divided into words and sentences by recognizable symbols, and the problem of ‘looking up’ a word in the dictionary becomes one of matching numbers until identity is established-a process which a computer can do very easily. Output of the translated text is in the form of punched tape, the same code being used for the representation of letters. A printed version is obtained via a teleprinter.
21
An Experiment in Mechanical Translation THE DICTIONARY
(1)
The dictionary is the heart of any scheme for mechanical translation, and in order to understand the details of this experiment it is necessary to describe the exact form in which it is stored in the computer. Each word in the dictionary is represented by the number obtained by coding the letters as described in the last paragraph (A = 1, B = 2, etc.). Because the machine uses a numerical process in ‘looking up’ entries, the French words are arranged in ascending order of numerical magnitude; fortunately, from the point of view of those constructing the dictionary, this coincides with alphabetic order because of the numerical code used for letters. In the simplest, or word-for-word, form of M.T., the machine having located the required entry in the French side of the dictionary would simply extract the English equivalent by taking the word from the ‘same line’, or more precisely, from the corresponding location in the corresponding track of the English dictionary. Thus in the current experiment, the French dictionary occupied the whole of tracks 18, 19 and 20 and part of track 21. The English translations might then have been stored in the corresponding location of tracks 22, 23, 24 and 25. For the more sophisticated scheme of the present experiment, it is, however, necessary to have some grammatical information about the words before translation can be effected. This is provided by interposing a set of ‘keys’ between the French and English words, these keys being stored in the corresponding locations to the French words as described above. In the stem dictionary the keys contain the following information. 1. 2. 3. 4. 5.
Part of speech (this is defined by a ‘structure number’). Location in which English translation can be found. Number of computer ‘words’ in translation. Person (in case of pronouns). Whether initial letter of translation is consonant or vowel.
Eleven types are distinguished under heading 1, and a 4-bit code distinguish them. The types are nouns, adjectives, verbs, pronouns pronouns (3), pronouns (4), UIZor une, full stop, end of message (end), syntactical significance, i.e. those not requiring rearrangement. The structure number zero. The classification of the pronouns into groups is
number serves to (l), pronouns (2), and words with no latter class have as follows.
Group 1: je, tu, il, elle, ils, elles. Group 2: nous, vous. Group 3 : me, mon, ma, mes, ta, ton, tes, son, ses, sa, le, la, les, leur, hi. Group 4: le, la, les (in typing the message 1’ must be typed as le). The location of the English translation requires 10 bits for definition. Where more than one computer word is required for the translation, consecutive words are used. The length of the English translation is limited to 18 letters for convenience, i.e. 3 computer words, thus heading 3 requires 2 bits for identification. Section 4 requires 3 bits to distinguish the 6 possible persons and heading 5, 1 bit, this being 0 for a consonant and 1 for a vowel. Thus 20 bits are used of the 32 available in the computer word. It is hoped that when more storage becomes available an extended programme, taking into account idioms, may be tried. In this. case the remaining 12 bits will be used for idiom classification in the manner described in [2].
22
K. H. V. BOOTH The endings
following
dictionary information.
is stored
in a similar
manner,
and here the keys contain
the
1. Location of translation (10 bits). 2. Type of ending (2 bits). 3. Person (3 bits). The type of ending may be either a prefix (e.g. the ending -er of the infinitive translates as ‘to’) or a suflix or both. If there is no English translation of the ending the location of translation given contains a number which gives no output in printing. A complication must be mentioned here, namely that the same ending may translate in different ways according to the type of stem. Thus, the steinJim will have the translation ‘finish’, and the ending -s attached to this stem requires no translation as in jefinis-I finish. Attached to a noun, however, it requires the translation -s as in plume/s-pen/s. This is overcome by examining the type of stem and extracting the ending translation from different locations accordingly. An imperfection still exists in this scheme since the same stem and ending may require a different translation according to the preceding pronoun. Thus the ending -e in Je donne and il donne translates as nothing in the first case (I give) and ‘s’ in the second (he gives). This could be dealt with by noting such ambiguous endings and making the programme examine the preceding pronoun where necessary. There was, however, insufficient room to include this in the present experiment. THE
PROGRAMME
The general scheme of operation is as follows. The message, punched in coded form on a tape as described, is placed in the input reader of the computer. The first word is read in, looked up in the dictionary, and the relevant keys for the stem and ending extracted and stored. A test is then made to see whether the stem key is that for ‘end of message’. If so the machine stops, but if not a test is made for ‘full stop’. If not a full stop, a counting index, which records the number of words in the sentence is increased by unity and the next word is read in. When ‘full stop’ is encountered, the programme then proceeds to apply the syntactical rules given in the preceding paragraph to the sentence. The correct rearrangement of nouns and adjectives is disentangled first, and the pronoun-verb rearrangements made where necessary. As soon as it is decided that a particular word has been satisfactorily resolved, its translation is punched out. This involves a separate subroutine which first extracts the ending, where one exists, and disposes it suitably with respect to the stem according as it is a suffix, prefix or both. The stem is then extracted, stored and finally the complete word is punched. It is not proposed to give a detailed list of the instructions involved in this programme, since this would be very lengthy and perhaps not of general interest. A detailed programme for a similar but simpler M.T. experiment is given in [3]. The flow diagram in Fig. 1 gives the best overall picture of the programme without resorting to tedious detail. Some explanation is needed of the details here: ir and i, (Box 1) are count numbers (or iteration indices) used during the syntactical analysis to keep track of the current position in the sentence. Fi is a code number used to set an instruction address in the contingency that the French word cannot be located in the dictionary. When this occurs the word is punched out in its original form, and in order to do this, the programme manufactures a ‘key’ which gives the location where the word is stored until ready to be punched, and which states that the word has structure number zero. At the start of the programme all these quantities, and a number of instructions which are modified during the programme, are reset.
t
start
FIG. 1
t
in
PunchfngsubroutIne
24
K. H. V. BOOTH
Box 2 represents the looking-up process in the dictionary. This is done by the bracketing method described in [4], the stem dictionary being consulted first. If an exact stem is found, the key is stored in one of a series of consecutive locations called k, and zero is placed in the corresponding place in a second series of ending locations k,. If no exact stem can be located, that giving the smallest positive remainder when subtracted from the French word is chosen, and the remainder looked up in the endings dictionary. If an exact ending is found, the ending key is stored in k,, but if not, the word is assumed not to be in the dictionary and a stem key is manufactured and stored in k,, k, being zero. k, is tested to see whether it represents ‘end of message’ or ‘full stop’ and the appropriate course of action followed as indicated; when a ‘full stop’ is encountered, the programme proceeds to a syntactical analysis of the sentence using the stem and ending keys which are now stored in consecutive locations. This part of the programme is represented by the boxes between CIand y, and it is hoped that the general idea can be seen from the flow diagram. The iteration indices i, and iz will be zero at the start of the syntactical analysis and the effect of box 3 initially is therefore to examine the stem key of the first word in the sentence. Normally i, and i, will be equal to the number of words of the sentence which have already been punched out, but in box 4, for example, i, is increased by one before the corresponding word is punched since it is necessary to see whether the next word is an adjective before deciding which to punch first. Similarly, in the pronoun verb section between b and y, it is sometimes necessary to go four words ahead before deciding on the order of punching. The punching subroutine extracts the translations of the stem and ending and arranges them, together with spaces where needed, in a series of locations PN, the contents of which are then punched. The programme takes 570 locations for instructions and storage, and running time is roughly 2 set per word including input and output. THE
DICTIONARY
(2)
It will be seen that since only 390 computer words were available for a dictionary, nothing very comprehensive could be attempted. In fact each dictionary entry requires at least 3 computer words, so that only about 130 entries could be made. A limitation of 6 letters (one computer word) was set on the stem and ending length, since otherwise two or more words would have to be allocated to every entry irrespective of length. No restriction was placed on the length of the English equivalent, however. Further to cut down on space, only the 1st and 3rd persons singular and the 3rd person plural were entered in the endings for verbs, and tenses were restricted to the present, future and past. It will be seen that sometimes different endings have the same translation. In these cases it is, of course, necessary to store the translation only once. Some entries in the stem dictionary have several translations according to the context, for example la. All possible meanings are output in these cases. It would be possible, by further syntactical analysis, to avoid this, but space did not permit the necessary extension to the programme. Similarly, the ending e when applied to a verb may either require no translation, or translation as the suflix ‘s’. Thus je parle-I speak, il parle-he speaks. In this case both possibilities are indicated in printing.
2.5
An Experiment in Mechanical Translation
Stem dictionary The following
NOUNS
words or symbols
comprised
the dictionary
of stems.
English
French
English
French
cheva chien eau end femme femmes garcon homme hommes
horse
‘w man men
jardin lettre livre maison nom oiseau poste rue table
garden letter book house name bird post street table
big pretty their heavy ill new
neuf noir pauvre petit triste vert
new black poor little sad green
aim demeur donn en tend entr ferm fi ni jou pat-1 perd saisi sauv suiv tomb
like live give hear enter shut finish
ADJECTIVES grand joli leurs Iourd malade neu
dog water not translated woman women
IVERBS (avoir)
(etre)
ai a ont eus eut eta-en t eu au
have has have had had had had have
suis est sont
am is are was was were be
fus fut Jl‘urent se
(others)
/
play talk lose seize save follow fall
26
K. H. V. BOOTH
Stem dictionary-continued The following
words or symbols
French
(3)
the dictionary
of stems.
English
French
she/it they
ils
they
elles
+ie
I
il
he/it
leur lui ma mon mes
their/to them
me sa
me/to me her
my my my
ses
he/his his
les
the or them
le
the or her/it the or him/it
d
to/at
Oii
where
avec
with
que
dans de des devant du encore et e’te’ hier mais
in
of the
qari sans si sur t&s
PRONOUNS elle (1)
comprised
to him
English
son -___
(4)
la
OTHER WORDS) (Structure number zero)
TELEPRINTERSYMBOLS
of/from of the before
whom/which who/which without if on very
again
Y
and been
un
there a or an
une
a or an
yesterday
.
.
but
,
, Carriage return
Carriage return Line space
Line space
Endings Dictionary Ending
Ending
Translation
Translation -
NOUNS
-s -1
ADJECTIVES -e -es -s
-ux -x
-s -
-S -s
_i -
-ve -ves
-
27
An Experiment in Mechanical Translation Ending
Person
Translation
VERBS
(p = prefix s = suffix)
-rai -ra -rant -e -ent -ai -a -2rent -erai -era -eront 4 -S -t -ssen t -rent -is -it -irent -U‘ 4s -Pe -Pes -er -ir -re
will will will -is -
(P> (P> (P> (s)
did did did will will will -
(P) (P) (P) (P> (P) (P>
es -
(s)
did did did did -
(P> (P> (P> (P) -
to to to
(P) (P) (P)
SOME EXAMPLES
1s 3s 3P 1s or 3s 3P 1s 3s 3P IS 3s 3P IS
3s 3P 3P 1s 3s 3P -
OF TRANSLATION
With such a limited dictionary it was, of course, necessary to construct sentences as tests. Here are some, together with their translations.
rather
artificial
1. L’homme qui m’a park! hier est tr& malade; il est tombe’ de son cheval devant sa maison. The man who/which did talk me/to-me yesterday is very ill. He/it did fall off/from his horse before his/her house. 2. Ils me don&rent un petit chien noir hier mais je l’ai perdu dans la rue. They did give me/to-me a little black dog yesterday, but I did lose him/it in the street. 3. 11s seront trb triste si Ieurs chevaux tombent dans I’eau. They will be very sad if their/to-them horses fall in the water. 4. Oic sont les Iettres que je leur donnai? Ils les porttrent 2 la poste. Where are the letters whom/which I did give their/to-them? They did carry them to/ at the post. 5. OC sont mes lettres ? Je Ies donnai d la femme avec le petit garcon. Where are my letters? I did give them to/at the woman with the little boy.
28
K. H. V. BOOTH
The object of this experiment was to devise a system for translating scientific material and although the examples given above cannot lay claim to literary elegance, there is no ambiguity in meaning; this is the important point in scientific translations. There are several refinements which it is hoped to add to the programme when space is available. One of these concerns the translation of past participles. Thus the past participle of such verbs as jouer and entrer requires the addition of ‘ed’ to the stem translation (play-ed, enter-ed); entendre and sakir require the translation ‘d’ (hear-d, seize-d) and tomber ‘en’ (fall-en). It is therefore necessary to subdivide verbs according to the form of their past participle in English. This can be done simply by increasing the number of structure types defined in the key word. The question of idioms was mentioned earlier. The inclusion of an idiom-detecting routine would certainly be essential for dealing with a literary text, and probably fairly desirable even with scientific texts, since the French language is highly idiomatic. Such a routine would be interposed at the point CIof the flow diagram. It would probably lead to a considerable increase in the size of the dictionary required, although the effect is hard to estimate, since no-one has yet attempted a compilation of idioms from the point of view of M.T. Acknowledgement-The author wishes to thank Mrs. M. Gould for considerable assistance in the preparation of the dictionary used in this experiment. REFERENCES [I] A. D. BOOTHand K. H. V. B~~ITH:Automatic Digital Calculators, Academic Press, New York (1956). [2] A. D. BOOTH,J. P. CLEAVEand L. BRANDWOOD:Mechanical Resolution of Linguistic Problems, Academic Press, New York (1958). [3] K. H. V. BOOTH:Programming for an Automatic Digital Calculator, Academic Press, New York (1958). [4] A. D. BOOTH:On the use of a Computing Machine as a Mechanical Dictionary, Nature, Lond. 1955,176, 565.