A MATHEMATICAL FORMULATION OF KEYWORD COMPRESSION FOR THESAURI LOUIS-GAVET GUI Unwerslte de Lyon 1, Charged’enselgnement Umverslte Lyon I. Mas des Berlandieres. Pte de Beaurefaire. Balbims. 38-260 La Cote Standre. France Abstract-In this paper we demonstrate a new method for concentratmg the Fet of key-words of a thesaurus This method IS based on a mathematical stud) that we have carried out mto the dlstrlbution of character< in a defined natural language We have hudt a function f of concentration whvzh generates only a few synonyms. In spplymg this function to the
I INTRODUCTION-CONTEXT
OF THE PROBLEM
In a documentary system, the main problem is less to bring an innovating solution into the structure or to link together the different files, than to compress data. Storage, Retrieval”. To The three key-words of a documentary system are “Information, reach the information with the greatest pertinence in a given field, we have to use a language which is in fact a metalanguage. It allows the various equations which is called “documentary”, to be solved. Such a language is represented in the documentary system by the thesaurus, the set of key-words (or concepts) being linked together by all the semantic and syntatic relations which are possible. Very simply we shall say that to answer a specific question, we have to look up in the thesaurus the concepts raised by the question. The situation is most difficult at the storage level of the thesaurus. Keywords have different lengths (sometimes as many as 70 characters in the chemical thesaurus). resulting in: -a rather substantial volume, -an important executive time while searching for a keyword. In order to have a really documentary system, it would be advisable to put the most important part of the thesaurus in memory. It is interesting to note that computer scientists have always tried to compress the data given in the files. This is all the truer since nowadays we witness a growing expansion of their volume. which consequently renders storage and search costs exhorbitant. The different research devoted to this delicate problem has aimed at drawing an advantage from: -the data with unique characteristics that might be compressed (spaces. logical links between one another, probabilities of occurence.. .): and -codes whose representation is greater than the number of characteristics which are ordinarily used (ASCII: 178 possible symbols, EBCDIC: 256 possible symbols). In the first case we have noted the experiments carried out by OLlVER[?4], HONIENLu[16], F;UMAN[9], S,~LTON[97], MARRON[6], CULLUM[5], HAHN[ 141. HASP System [73] in the second case: BEMER[~]. DoNIo[~], CORGIER[~],DEWEZE[~]. GOTTLIEB[~11, FRAZER[IO]give a good classification of these methods. The operation of these methods is more or less complex and this often leads to a different application (for example in the SOLID system (6): 7 steps are required to compress the data) and it must be noted that those which are presently applied only reach rates of compression inferior to 0.5 (4 or 5 bits by characters) for texts written in natural language, the most performant seems to be that of Hahn: 3-4 bits/char [ 141. More particularly to compress the keywords of a thesaurus note the following experiments. Deweze’s thesaurus “Economic et gestion des entreprises[7] has tried to produce unique arrangements of 5 characters in the keywords, by studying those which were the most significant for definite keywords. At this level it is less a study of the structure of the keywords than an abbreviation as is proposed by SAVAGE[~~]. BOURNE or FoRo[3]. However we reach a
1x9
190
1. -(I
very
high
rate
parttcular
the the\:turu\. In
permitted
(e.g. oxyde,
On
the
keyword
hand.
the code
the natural allowed
us to find
-that
with
i\URUS’*
the
of
this
a high
effect
have
keywords
original thesaurus
compression
parts
we Itmited
of
by
has
Rr I-IAN MA~FR(?]
three
of
(this
algorithm
tn a set of information).
be ratted
whcrc
corpus
must be eased by
in
a’r
50 far
elements different
it is a
of a keyword. method\
are put
on the one hand. ;md the ;tcc‘e\\ time to ;I
We have studied
of its dtfferent method
with
multiple
when it is no longer
show
wtth
our
of
forward
in a thesaurus file
we
to find
at a rate
the original
of
data.
can
create
a table
called
“TABLE-THES-
retrieval
m a few
words
(Section
f of compression
(Section
such
applications[21)
nece\s;rry
of
or trtgrams.
in memory
an original
a function
structure
bigram,
method:
constantly compression
put
the mathematical
monogram\.
that
3). In Section
a compresston
rate
2) on the we have
4 we shall
ha\
permitted
natural
applied show
languages
on the set of
that
us to store
.APPRO4C‘H
the sets T and E and I function
representing
the input
of elements
or alphabet,
data
Ot
representing
the output
representing
a functton
rHF
through
an
the whole
f defined
of a file
a\ follows:
.4, written
data or “prtnts”
{u,.6’:.
of compression
t,.
in a natural
language
t,,}
of a tile
.
PROBl.EM~IK
c,.
B:
r,,}
such that
with t, f t, -
f(tt)
* fct,)
if
T is a finite set
Consider: -the
keyword.
a finite
of a heyword
hardly
l/30
us to find
of binary
path.
compression
E = -f
a to
code to each of the common
of the common
\y\tem[?Y]
T = {t,. t:. --E
if by
in memory.
Consider number
can
of the thesaurus
2 MATHEM4TIC’Al.
-T
\ame
to each other
and
rate
we shall
found
structure
are by
to the sum of set\
another
we can make
allowed
done
are closest
the distribution
can remain
on which
on the
nor
been
the .ABAC’US
one l/4
we shall
a thesaurus
Moreover
we add an element
hand
of around
Consequently
To
rate
through
-that
out
that the retrieval
whtch
is linked
we have followed
language
compression
which
the quote
on the other
For one part.
usable.
ever-y ttmr
of a numerical
synonyms
as has
of all words
when
we shall
we
to try tu reduce the volume
together
it hardly
\pnonyms.
carrted
tt IS noticeable
algorithm[27]
other
To finish.
a study
have
to this,
Queen
makes
fewer
.)[ 131. neither
of the research
codification
Tht\
ether.
condition\
the
consists
which
of these by the association
In addition
applying
method we obtain
thesauruse\,
a compression
keywords.
this
1
thi4 advantage
chemical
In those
by
of abbreviation
we Io\c
some
parts
synonymity
technique
(iI
elements
t of the set
T, being strings
of characters
of variable
with (7, E (letters.
tigurc.
qx~ce\.
special
charac?rr\)
length
L with
a hmte
A mathemattcal
-the
element
formulation
of keyword
compression
e of the set E, being strings of characters e={v,vz...v
,...
I91
for theuuri
of fixed length or “print”
vq}
with V, E {letters. figure, spaces, special characters} with
14
(4 fixed)
can be resumed in the following schema:
data
Output
data
set T { tl J t
FILE
In order to standardize, BACKUS notation [2].
(Monogram) (Character) (letter) (space) (delimiter) (rare character)
A
FILE
a set of definitions
has been constituted
: : = (character) : : = (letter)~(space)~(delimeter)~(rare : : = AlBlC IZ . Xl-l
0
according
to this table in
character)
* Asterisk, end of u element t (figure)J(orthographic sign)\(punctuation 0~1~2~3~. . . . . . 19 ’ Apostrophe, - hyphen , (;(:(!(?(. 1” +I% (monogram)(monogram) (Trigram) : : = (monogram)(monogram)(monogram) This table will be completed by these definitions:
::= ::= : := (figure) (orthographic sign) : : = (punctuation mark): : = (special character) : : = ::= (Digram)
mark)l(special
character)
= element t of the set T (identifier) : : = number of elements of a set (cardinal) (redundancy) : : = identity of monogram, digram, trigram For our problem we will consider a statistical structure (0. (1, p) 0 set of characters representing the set of identifiers a set of parts of 0 p set of laws of probability 2.2 Theorem-corollary
Let be an input string representing
an identifier
t={a,,az ,..., The written. Our classic double
ux ,...,
up)
succession of monograms u, reflects a structure defining the language in which t is The problem is to know whether the (TVare distributed according to a law. first step will be to consider a monogram as independent and as a random variable in a statistical law. But, it can only be a rough approximation (especially because of the monograms), that’s why we have been led to consider several approximations in
I91
I. -(i
(,I
k
succession by taking the (.x-uples) of monograms {a,}, {u,. (r,, ,}, (CT,.F,, ,. . CT.,} as independent elements. We prove that an adequation test 151with the binomial law is verified progressively better as j increases.
In thi\ first part we shall Introduce the following theorem, Let t be an input string representing an identifier: t = (u,.
(72.
,
u,.
u,,}
(with CKE letter, figure. spaces. special characters). arranged in an order which defines the identifier language. Let (x-uples) be couples of monograms. The adequation test with the binomial law must apply with a threshold (Y of signification and:
S E {I.. n} parameter depends on the language. givmg the independance the following monogram.
of a (s-uple) with
Corollur~ : Let:
-4 be set of monograms representing the set of the identifiers -E be set of (x-uples) applying to the adequation tests of the binomial number f? be complementary set: M its number If
[Ml 7=[Nl (with a threshold
of signification
MONOGRAMME .A B c II E F G H
7-O
if
law: let N be this
s-x
Q ).
POUR LNF .ACCEPTATION ALPHA
0 hS0 (I
IS0
II 900 0 “(I
.4 mathematical formulation of keyword comprewon ADEQUATION x Y Z
193
for thesaurl
TESTS (Co&)
02.70 005 I? 014.47
0 x0
AdequatIon test m terms of a vgmficatlve alpha 0.9
AC FR
AY GA
BI GU
BL GI
BO ID
BR JA
Cl LU
CL MM
CR MP
DO NCJ
DR OP
ED OR
EE OS
EG RD
El RE
EO RL
E RR
0.8
AB PA
AG PE
AM PR
BA PS
CT RT
DO SA
DR SI
IC so
IM ss
I s
JE
MI
NA
NI
OL
OM
OU
0.7
Al TS
AL TT
CH TU
EL UT
EM VO
GF H
HE I
IO
IQ
IR
IT
LI
LL
LO
L
L
D
Oh
AT
AU
A
CH
EN
IF
IS
ME
NE
A
E
M
P
R
S
0.5
MO
NA
NC
NI
NS
NT
OR
OU
RA
RO
R
04
EC
EF
EZ
HU
IA
OS
PU
RU
UB
VO
03
AR
AV
AP
BU
CU
DI
FO
01
RI
VA
3. STRUCTURE
OF THE
“PRINT”
OF A KEYWORD[l9]
3.1 Critena defining the construction of the “print” For the method to be efficient, the “print” must have the following characteristics: -its length must be fixed and much inferior to the average length of an identifier (p
1st
L-G
c;ill
Example: THE DOCUMENTARY A print on 4 characters
INFORMATIC
S.
will give: OTNT
An exception
to thts rule exists when p G q. In this case, the “print”
~111be the keyword.
3.2 Improvements of thr algorithm The statistics that we have developed u priori indtcate the risk of synonymy of two “prints”[8], but do not indicate the possible defects of the structure of the monograms. Just as we have a length of fixed print, we might have particular cases at the level of the distribution of the same monograms (or digrams. trigrams) in the same position of the print (defects which should come from the structure of the keywords). In these conditions wouldn’t it be superfluous or insufficient to take a string of x monograms as a print? Thus, it is at the level of the algorithm. which builds the print that we have had to avoid the defects of distributing monograms in the keyword. This implies that we have had to develop different statistics at the level of the print, particularly in its entirety, because the natural language is in fact an estimation of an ergotic system. but we shall have on no account such an approximation for the set of “prints” (It is obvious that it must be the contrary. In fact to have the maximum of combmattons. the monograms must be distributed randomly in the set of prints.) Let’s take a simple example. We have noted that the vowels appear at a rate of 34% in secondary position in the prints. We managed to power this “rise” by improving the algorithm thanks to the statistic at theory developed on the set of prints. This proved sufficient at first. But very soon we saw that we were no longer controlling the other defects which were developing in other respects at the level of the print. So as the print has a fixed length, we thought of a statistical theory which connects the position of the monogram to the monogram itself. We have thus created a mathematical implement, which enables the distribution of monograms in each position of the print to be known for a set of keywords. This is summed up in the following schema:
Good
olgorlthm
4 mathematical
formulation
of keyword comprewon
195
for thesaurl
We have not presented a complete list of all the improvements that we can make on our algorithm. They are numerous and we have missed certainly a few. So we shall only quote some of them. the most important, that we have experimented on in several different sets
First improllement : This first improvement will be immediate and will consist in avoiding in the algorithm the particular zones of the element e of sets E. Let’s take two examples: -50% of titles of books begin by an article. thus there is a strong probability of monograms “I, e. 0”. . in this zone. -in the names and the keywords. the monogram “space”, is always in the same zone.
Second
improtlemenf
By this algorithm, we cannot have in the set of the prints. monograms which are not to be found in the set of identifiers. To avoid this inconvenience which was diminishing the potential number of monograms, we have given each a “weight”. (a) Notion of weight, first form. This notion implies that a monogram may only appear in the string of characters of a print a limited number of times. Let n be this number. We substitute the (n + 1.. . . n + m) redundancies of this monogram for rare or special monograms (Y. W, ?, ! . .). Each monogram is linked to a specific list of monograms. Example for A:
?willbethe(n+l) ! will be the (n + 2) With this method the following print: ABJAAA will become: ABJ?!-. The method i< particularly interesting if the “print” is long. Remark: To avoid a redundancy too strong in certain zones (for example the “-” in 6th position), rules will be used to apply this method to the beginning of the print. For example, as 50% of titles of books begin by an article, the method in this case will apply to the beginning of the print. For example: the print AAREAA will become: -!RE?A (b) Notion of weight, second form. This notion is linked to the frequencies of a monogram in the set of the identifiers and to their positions in the prints (see the following table).
I
! 3 -l
Space
.4
‘1
1
0
J ):
;
E
I
S
R
B
4 F
7 R
OK$/
I
?SP%
M
L.
T
+
c f
Y
.;368W:,.
This method is applied on short prints. To sum up. these improvements have a triple advantage: -to operate on the interval of redundancy of monograms so that they are all equal -to introduce monograms which do not belong to the set of identifiers -to create a balance between all the monograms used in the set of identifiers.
3.3 Application In applying our function
f. with the following hypotheses
and the second improvement,
to
I Yh
I. 4,
(ii,
the set of keywords: Card ( T I = 3000 C’= 64 q = 4j L’= {I’,l’21’11’~} 3h e.. L 3 the step A c. L/3
1 _
p = ‘5.
Example\ (‘ONTROLELIR * ENTREPRISE * GESTION FINANCIERE * INTEGRATION / Pi.4NIFIC‘.4I‘IOY BENEFICE * CORRELATION * .4CH.4T * REDOND.-ZNCE * COMPARATEI’R ENERCETIQUE * GAIN MENSUFL * TAU\( HORAIRE * INVESTISSEMENT PL.4NIFIE * RETRAIT * TRAITEMENT .4BATTEMENT DE ZONE + INDEMNITE JOURNALIERE * PONDERATION DES PRIN * PI ~4NlFIC‘;\l‘lON DES C‘OLITS x RUPTURE DE STOCK * ECHE4NCt * CIRCUIT FINANCIER * FINANCEMENT DES ENTREPRISES * RATIONNALISATION DES CHOIS BUDGETAIRES ANALYSE ECONOMIQUE * ECONOMIE DE L’INFORMATIQUE * AUTOFINANCEMENT * BANQUE DE DONNEES + COMPTABILITF ,ANALYTIQUF . COMPTE D‘EXPLOITATION CONTROLE DE GESTION 1 COUT * OPER.4TION FiNANClERE * PRIY DE REVIENT *
I T’7 \ PlZli I(& \ TRIM NC“ 11’ ‘r RLUk C VI I DWNR R Fl MZL ‘iHIU I \I Q\ I’RP( I NQ TH’Zr; UK\‘M II.SX FY P :m WFC’I HN 1 \‘I’;( I:/\ + SU(~,Z YCV’L PYCM PKFT VI EQ NRYB RPTC; RD\VT COL’T \,NR YIVI.
We have not had synonym\ ((“I = 64J # 16.5 M of pos\ibihttes of random “prints”). We have tried for q = 33~7 = {18,I~~I’~} and we have obtained a rate of synonymity of I/ 1000 ( 1for thousand). This is negligible (see examples on the following page).
j
Fl,4B()R.~[loN
01.
,\
‘l:\Hl
F-IHESALRI’S”
(It
-00
BY JE\
With such a rate of compression, we demonstrate that a thesaurus must be stored entirely in memory. To introduce our method of storage and the retrieval of a keyword in the “TABLETHESAURUS” we describe very briefly what an arborescent file is.
4.1 Method of the arhorescent file It is a classic technique. Each character is connected by a link. There is a character by byte, the others are used as pointers (of the following or alternate character. a d indicates that it doesn’t exist) or as an end of word ((a * in the 4th character of the byte). Example: Let’s store COBOL. COCA. COCO
.4 mathematical
formulation
of keyword
197
compression for theraurl
bytes
This method has particular advantages in its use and it avoids a file being too greatly increased when there is insertion of a new word. The method however has notable inconveniences: -the retrieval takes an appreciable time when we must determine if it is a long word. -the rate of memory insertion is very important (it is 3 times as important as any other method). -operation of the programming is rather complex. More elaborate methods have been developed by Knuth, especially thanks to binary trees. But let us say that nothing has fundamentally changed (variable length, double link and a rather complex research procedure. . .)[ 171. 4.2 Elaboration of the “TABLE-THESAURUS” Our method avoids all these inconveniences and improves the qualities of an arborescent file because: -the words are fixed, we have no mark for the end of words -the words are very short in length. These two points have enabled us to find an original structure. In fact, the file will be made up of 64 words of 64 bits each and this twice (if we consider a reduction of a keyword on 3 characters). A 64
0
bits
We obtain a file at 2 levels, where the absence or presence of the bit indicates the presence or not of the character in the logical succession (the alternate pointer no longer exists). Example: Let’s store BAB
f1rc.t level
0
,
0
_____________
0
0
____________________----__-_
Thus, at each level, for each character,
0
second
the bit indicates following character,
level
so if we have
L 4.
IYS
tit,
prints of length y the number of levels will be: F,,-,llj If we store B.4B. ABC & .4BC, we shall have in memory: A I
I 0
.i
0” ~:,:
B
0
. .
2-0.0
A
ftrst level
0
secondlevel
l3
i
0
0
Six of the “T,4BLE-THESAURUS” With a synonymity rate of l/1000. this table will occupy (64~ 63) g 2/8 # 1200 hytrs of m~tnory. an important fact. But we can reduce this table leading to more security in the synonyms. With y = 3 we have 64’ # 260,000 possible combinations. or if we take y = 3 but with ;i set of 30 monograms we shall have 30’ = 810.000 possible combinations. so a rate of synonymit! below l/1000. In these conditions our “TABLE-THESAURUS” will occupy (30 *: 30) x 3/S = 340 hates in memory whatever the number of keywords is in a normal thesaurus. while attaining :i very important fiability.
4.3 Prohrlhility of “noise” The “noise” in our \y\tem ha\ two or&\: -the “noise” which comes from synonyms -the “noj\e” which come\ from the Gructure of WJ “TABl,E-THES4URUS“. We shan’t come back to the first point. it is however interesting to develop the second because if this “noise” IS too important. it can render this table structure Inadequate. In this way, the “TABLE-THESAURUS” can produce keywords which in fact do not exist in the thesaurus because keywords can have common endings. ,4 cl:t\cic c~~lcul~lt;~~n\how\ that the prob~~bility of ;I synonym in ;I “TABI.E-THF,SAURCS” with three level\ for ;I corpu\ of 30 monogram\ i\ ahout I. f/loOO. which i\ next to nothing.
This probability increases slightly if the compression rate decreases. a “print” of 5 characters. the noise increases by about 1/8~0,000.
This noise becomes ‘110.000.
insignificant
For example if we take
if we work with a print of 3 character\.
it is then about
4.4 Elahorcdion of (I synmetricd “TABLE-THESAURUS” This noise may however prove a serious drawback and we must suppress the defect by building a table of the same size, but symmetrical to the T,4BLE-THESALrRLfS. For example if we have ABCD in the symmetrical table we shali store DCBA which will remove an! doubt as to the identity of the true keyword in the thesaurus. Suppose a keyword exists in the thesaurus and its print is ABCD. If we look for keyword< whose prints are DACD and BCCD and which do not exist in the thecaurus. we shall nevertheless be able to find them (provided of course that other existing keywords begin by D,4
A mathematical
formulation
of keyword comprewon
for thesauri
199
If we use a symetrical “TABLE-THESAURUS” this ambiguity is cancelled because in the latter we shall only have the print of the true keyword which is in this case DCBA (see the following picture).
m ________-_____
existmg keyword non extstmg keyword
With this symmetrical “TABLE-THESAURUS” 1.21/10” (1 on one million), thus next to nothing.
the rate of noise synonymity
is then of
5. CONCLUSION
Our function f of concentration applied on the set of keywords of a thesaurus, permits us to draw remarkable results: -whatever the length of a keyword may be, it can be reduced (c = 30) to 4 characters with a very small synonymity risk of about l/1000. -this allows us to build an original table structure called “TABLE-THESAURUS” with less than 700 (seven hundred bytes), where we can store all the keywords of a thesaurus -the time of retrieval of a keyword (one test for 6 bits) is instantaneous. This technique suppresses: -the transferences of the parts of thesaurus between the memory and the peripherials -the tests on each character of a keyword as well as on the different pointers. -the management of this “TABLE-THESAURUS” raises no difficulty (no pointers. no work-ends). We have applied this method[22], and the results have verified on all points what the laboratory experiments have shown. So, given that the system query is correct in natural language, such a work-instrument like the “TABLE-THESAURUS” is useful. A keyword or an athematic term may be recognized instantaneously. With such a considerable compression of a thesaurus therefore we can bring a very great improvement into the organization of the documentary system and thus considerably decrease its cost in use (on the storage and on the time of retrieval), which is now holding back the development of automatic documentation.
XII
L.-G
GI?
We can quote the following significant results which have been experimented and which are used[31]: -possible reduction of TITLES-AUTHORS of a book on IO characters for a volume of 3 M books (it is obviou\ that for a small library the compression rate would be greater) -reduction of names-Christian names of persons on 4 characters for a file containing 20.000 headings -this last experience has led us to experiment our method on the hash coding. We reach the creation of a quasi unique function H. This allows us to keep all the available room in the memory. for we have no more synonyms to manage -elaboration of a tool for comparing texts written in a natural language different or otherwise. Application on the detection of young social misfits model -formalization of a computer research through an “Information-Decision” -construction of 3 function f ’ to retrieve the original information with a rate of compression of 7O”r. REEEKENCES 1I ] [2] [3] IA]
R W Bt MrK. Do II h) the number. DIgItal shorthand. Comtn .ACM 1960. 5%536 L l301.l IFT. Notatlon et proce\sus de traductlon de\ langages \ymbohqur\ Th&r Grenoble. pp 412-A?? (June 1967) C BlNRhF and D FORU. A study of method< for \y\tematlcally Engh\h wordr name\ J .4C’M 1961, 225-119. H C‘IIK(,I~K. I;ne mithode phonetlque de re\herche et de lm~\e it Iour \ur tiLhlrr\ multiple\ ou de ma\\e Trurd et \Irtkdt3 lY71). 3Y-II
161D DF M%IN~ ,md .A MIKRON. .Automatlc data compre\wn Comm. ACM 1’267. ‘II-715 171 A Dr-w Tut. Etahll\wment et e\ploltatlon automatlque de fichw\ de cltatlon, blbhographlque\ &r/l L/NESCO ‘iv111 1464 I. IOI-IOX [XI J DINIO. Le prolet 4IDE ParI\. AN P E IDec lY731 191 R F,AJ\I~~ and .4 B~IK(,FI r. WYLBUR an mteractlve text edltmg ,md remote Job entry \y\tem Comm .4CM 1973. 31&3x? IO] W D FR\L~K. Compreswn Pntwn of Computer tile Data. I FI LrS .4-J-\P.AN C’omprrter C’onfrrmtr ( 1972) I I I H (JI,III IFH..Acl,l\\dicatlon of compre\w,n method\ and then u\efulne\\ for (1 I,lrper proce\\lng center Nutroflo/ ~‘(Inlprrlrr Cl)lrfertw~ r ( I Y75 1 121(‘ (;III K~I ,ind T B IY \KII. Etude du dibeloppemcnt J‘un \k\t;‘rnr document.ure automatlque pour I’mformatlon chtmlque en proprlit& mdustrlelle DIAPASON. MemoIre de fin d’itude INTD (Del: 19741 131 W HA(,IMFU. Encoding verbal mformatlon as umque number5 IBM System\ JII. 4 ( 1973) IdI B. HAHU, .A new techmque for comprewon and storage data. Comm .4C,ZI 197-1. 13G36 151 J HI\I ttl~. Utdl,atlon d’un Calculateur en ?tatlstlque The,e 3e cycle Grenoble (June I9701 161 L[’ HONIFU. A file management \)rtem for large corporate mformntwn uy\tem data bank Fall Jnmf Computer (‘onfHwlcY 1968. 33, I45- I56 171 D K;UI’rH. I% Art CI~C‘ornprrtrr Pro~rc~mmrn~. Vol 3. Chap 6 .Addl\on-We\le). New York (19731 181 G Lot ~\-Gi\cl, Etude mathimntlque pour la wncentratlon de fichlers occupant un volume Important. Statlstlques mformatwnnelle\ AFCET R! pp 101-l I I 1lY71) 191 G LIII ~\-C,ivr r. Etude mathimatlque pour la concentration de fichier\ ocLupant un volume important (%me partte). Out11 mathimatlque .4FCET RI pp. 71-X0 f 1972) 201 G. LOI lGG4YtT. Etude d’un &nlthme pour rPdu,re de, hchler\. .4FCET El pp 17-30 (19731 !I 1 G L~I[ I\-G \vt I. Compactage de donnee\ \tructur&\ Contrtbutlon\ i la conception d’un syrtcme d‘mformattons iompwe de lichler, multlpler et volummeuu There L)on (June I9741 221 G I.~)L;IGG~\ FT. Elaboration d’un \y&me documentwe a)ant comme Iangage d’mterrogatlon, le langage nature!. Rapport contrat Clntte Hermitlque Lyon (June 1975) I!] Multllea\mg The HASP S)5tem IBM Puhl pp 1139~1153 ~Feb lY711 241 B M 01 ivbK. Efficient codmg-hell \y\tem i”ethn J 1952. 2114). Z&751 2.51 J Rt r~~t+\~ FH. Ftle ordermg and retrieval Lost mformutlon. Storu~u. Rrfritwl Pergamon Press. New-York (1971). !6] M t&ttz, Some method\ for cl,~\\dicatton and .mal!\ or multlv,wable uh\er\atton\ Sfh Berkelep ~~mpo.\rum we hl~~rherr~ut~c~c ud Prohddrlv Vol I. No I I 1967) 271 G .$41TOM. Automatic Information. Orgam\atlon .md Retrieval. Chap I. II. III McGraw-Hdl. New-Yorh (19681. 281 T SA\’ \(,F. 4 note on the evaluation of method5 for \y\tematlcally abbrewatmg Enghsh words. .4m Docum. 1973. I& 11 ?Y] J l%~ I. B L~K\II~ ,md .A LIUII. lnformatton retrle\dl wth the ABACUS Prugram InternatIonal Atomtc Energy Agent). Vlenne IIY?!)