A mathematical formulation of keyword compression for thesauri

A mathematical formulation of keyword compression for thesauri

A MATHEMATICAL FORMULATION OF KEYWORD COMPRESSION FOR THESAURI LOUIS-GAVET GUI Unwerslte de Lyon 1, Charged’enselgnement Umverslte Lyon I. Mas des Ber...

787KB Sizes 0 Downloads 26 Views

A MATHEMATICAL FORMULATION OF KEYWORD COMPRESSION FOR THESAURI LOUIS-GAVET GUI Unwerslte de Lyon 1, Charged’enselgnement Umverslte Lyon I. Mas des Berlandieres. Pte de Beaurefaire. Balbims. 38-260 La Cote Standre. France Abstract-In this paper we demonstrate a new method for concentratmg the Fet of key-words of a thesaurus This method IS based on a mathematical stud) that we have carried out mto the dlstrlbution of character< in a defined natural language We have hudt a function f of concentration whvzh generates only a few synonyms. In spplymg this function to the
I INTRODUCTION-CONTEXT

OF THE PROBLEM

In a documentary system, the main problem is less to bring an innovating solution into the structure or to link together the different files, than to compress data. Storage, Retrieval”. To The three key-words of a documentary system are “Information, reach the information with the greatest pertinence in a given field, we have to use a language which is in fact a metalanguage. It allows the various equations which is called “documentary”, to be solved. Such a language is represented in the documentary system by the thesaurus, the set of key-words (or concepts) being linked together by all the semantic and syntatic relations which are possible. Very simply we shall say that to answer a specific question, we have to look up in the thesaurus the concepts raised by the question. The situation is most difficult at the storage level of the thesaurus. Keywords have different lengths (sometimes as many as 70 characters in the chemical thesaurus). resulting in: -a rather substantial volume, -an important executive time while searching for a keyword. In order to have a really documentary system, it would be advisable to put the most important part of the thesaurus in memory. It is interesting to note that computer scientists have always tried to compress the data given in the files. This is all the truer since nowadays we witness a growing expansion of their volume. which consequently renders storage and search costs exhorbitant. The different research devoted to this delicate problem has aimed at drawing an advantage from: -the data with unique characteristics that might be compressed (spaces. logical links between one another, probabilities of occurence.. .): and -codes whose representation is greater than the number of characteristics which are ordinarily used (ASCII: 178 possible symbols, EBCDIC: 256 possible symbols). In the first case we have noted the experiments carried out by OLlVER[?4], HONIENLu[16], F;UMAN[9], S,~LTON[97], MARRON[6], CULLUM[5], HAHN[ 141. HASP System [73] in the second case: BEMER[~]. DoNIo[~], CORGIER[~],DEWEZE[~]. GOTTLIEB[~11, FRAZER[IO]give a good classification of these methods. The operation of these methods is more or less complex and this often leads to a different application (for example in the SOLID system (6): 7 steps are required to compress the data) and it must be noted that those which are presently applied only reach rates of compression inferior to 0.5 (4 or 5 bits by characters) for texts written in natural language, the most performant seems to be that of Hahn: 3-4 bits/char [ 141. More particularly to compress the keywords of a thesaurus note the following experiments. Deweze’s thesaurus “Economic et gestion des entreprises[7] has tried to produce unique arrangements of 5 characters in the keywords, by studying those which were the most significant for definite keywords. At this level it is less a study of the structure of the keywords than an abbreviation as is proposed by SAVAGE[~~]. BOURNE or FoRo[3]. However we reach a

1x9

190

1. -(I

very

high

rate

parttcular

the the\:turu\. In

permitted

(e.g. oxyde,

On

the

keyword

hand.

the code

the natural allowed

us to find

-that

with

i\URUS’*

the

of

this

a high

effect

have

keywords

original thesaurus

compression

parts

we Itmited

of

by

has

Rr I-IAN MA~FR(?]

three

of

(this

algorithm

tn a set of information).

be ratted

whcrc

corpus

must be eased by

in

a’r

50 far

elements different

it is a

of a keyword. method\

are put

on the one hand. ;md the ;tcc‘e\\ time to ;I

We have studied

of its dtfferent method

with

multiple

when it is no longer

show

wtth

our

of

forward

in a thesaurus file

we

to find

at a rate

the original

of

data.

can

create

a table

called

“TABLE-THES-

retrieval

m a few

words

(Section

f of compression

(Section

such

applications[21)

nece\s;rry

of

or trtgrams.

in memory

an original

a function

structure

bigram,

method:

constantly compression

put

the mathematical

monogram\.

that

3). In Section

a compresston

rate

2) on the we have

4 we shall

ha\

permitted

natural

applied show

languages

on the set of

that

us to store

.APPRO4C‘H

the sets T and E and I function

representing

the input

of elements

or alphabet,

data

Ot

representing

the output

representing

a functton

rHF

through

an

the whole

f defined

of a file

a\ follows:

.4, written

data or “prtnts”

{u,.6’:.

of compression

t,.

in a natural

language

t,,}

of a tile

.

PROBl.EM~IK

c,.

B:

r,,}

such that

with t, f t, -

f(tt)

* fct,)

if

T is a finite set

Consider: -the

keyword.

a finite

of a heyword

hardly

l/30

us to find

of binary

path.

compression

E = -f

a to

code to each of the common

of the common

\y\tem[?Y]

T = {t,. t:. --E

if by

in memory.

Consider number

can

of the thesaurus

2 MATHEM4TIC’Al.

-T

\ame

to each other

and

rate

we shall

found

structure

are by

to the sum of set\

another

we can make

allowed

done

are closest

the distribution

can remain

on which

on the

nor

been

the .ABAC’US

one l/4

we shall

a thesaurus

Moreover

we add an element

hand

of around

Consequently

To

rate

through

-that

out

that the retrieval

whtch

is linked

we have followed

language

compression

which

the quote

on the other

For one part.

usable.

ever-y ttmr

of a numerical

synonyms

as has

of all words

when

we shall

we

to try tu reduce the volume

together

it hardly

\pnonyms.

carrted

tt IS noticeable

algorithm[27]

other

To finish.

a study

have

to this,

Queen

makes

fewer

.)[ 131. neither

of the research

codification

Tht\

ether.

condition\

the

consists

which

of these by the association

In addition

applying

method we obtain

thesauruse\,

a compression

keywords.

this

1

thi4 advantage

chemical

In those

by

of abbreviation

we Io\c

some

parts

synonymity

technique

(iI

elements

t of the set

T, being strings

of characters

of variable

with (7, E (letters.

tigurc.

qx~ce\.

special

charac?rr\)

length

L with

a hmte

A mathemattcal

-the

element

formulation

of keyword

compression

e of the set E, being strings of characters e={v,vz...v

,...

I91

for theuuri

of fixed length or “print”

vq}

with V, E {letters. figure, spaces, special characters} with

14
(4 fixed)

can be resumed in the following schema:

data

Output

data

set T { tl J t

FILE

In order to standardize, BACKUS notation [2].

(Monogram) (Character) (letter) (space) (delimiter) (rare character)

A

FILE

a set of definitions

has been constituted

: : = (character) : : = (letter)~(space)~(delimeter)~(rare : : = AlBlC IZ . Xl-l

0

according

to this table in

character)

* Asterisk, end of u element t (figure)J(orthographic sign)\(punctuation 0~1~2~3~. . . . . . 19 ’ Apostrophe, - hyphen , (;(:(!(?(. 1” +I% (monogram)(monogram) (Trigram) : : = (monogram)(monogram)(monogram) This table will be completed by these definitions:

::= ::= : := (figure) (orthographic sign) : : = (punctuation mark): : = (special character) : : = ::= (Digram)

mark)l(special

character)

= element t of the set T (identifier) : : = number of elements of a set (cardinal) (redundancy) : : = identity of monogram, digram, trigram For our problem we will consider a statistical structure (0. (1, p) 0 set of characters representing the set of identifiers a set of parts of 0 p set of laws of probability 2.2 Theorem-corollary

Let be an input string representing

an identifier

t={a,,az ,..., The written. Our classic double

ux ,...,

up)

succession of monograms u, reflects a structure defining the language in which t is The problem is to know whether the (TVare distributed according to a law. first step will be to consider a monogram as independent and as a random variable in a statistical law. But, it can only be a rough approximation (especially because of the monograms), that’s why we have been led to consider several approximations in

I91

I. -(i

(,I

k

succession by taking the (.x-uples) of monograms {a,}, {u,. (r,, ,}, (CT,.F,, ,. . CT.,} as independent elements. We prove that an adequation test 151with the binomial law is verified progressively better as j increases.

In thi\ first part we shall Introduce the following theorem, Let t be an input string representing an identifier: t = (u,.

(72.

,

u,.

u,,}

(with CKE letter, figure. spaces. special characters). arranged in an order which defines the identifier language. Let (x-uples) be couples of monograms. The adequation test with the binomial law must apply with a threshold (Y of signification and:

S E {I.. n} parameter depends on the language. givmg the independance the following monogram.

of a (s-uple) with

Corollur~ : Let:

-4 be set of monograms representing the set of the identifiers -E be set of (x-uples) applying to the adequation tests of the binomial number f? be complementary set: M its number If

[Ml 7=[Nl (with a threshold

of signification

MONOGRAMME .A B c II E F G H

7-O

if

law: let N be this

s-x

Q ).

POUR LNF .ACCEPTATION ALPHA

0 hS0 (I

IS0

II 900 0 “(I

.4 mathematical formulation of keyword comprewon ADEQUATION x Y Z

193

for thesaurl

TESTS (Co&)

02.70 005 I? 014.47

0 x0

AdequatIon test m terms of a vgmficatlve alpha 0.9

AC FR

AY GA

BI GU

BL GI

BO ID

BR JA

Cl LU

CL MM

CR MP

DO NCJ

DR OP

ED OR

EE OS

EG RD

El RE

EO RL

E RR

0.8

AB PA

AG PE

AM PR

BA PS

CT RT

DO SA

DR SI

IC so

IM ss

I s

JE

MI

NA

NI

OL

OM

OU

0.7

Al TS

AL TT

CH TU

EL UT

EM VO

GF H

HE I

IO

IQ

IR

IT

LI

LL

LO

L

L

D

Oh

AT

AU

A

CH

EN

IF

IS

ME

NE

A

E

M

P

R

S

0.5

MO

NA

NC

NI

NS

NT

OR

OU

RA

RO

R

04

EC

EF

EZ

HU

IA

OS

PU

RU

UB

VO

03

AR

AV

AP

BU

CU

DI

FO

01

RI

VA

3. STRUCTURE

OF THE

“PRINT”

OF A KEYWORD[l9]

3.1 Critena defining the construction of the “print” For the method to be efficient, the “print” must have the following characteristics: -its length must be fixed and much inferior to the average length of an identifier (p
1st

L-G

c;ill

Example: THE DOCUMENTARY A print on 4 characters

INFORMATIC

S.

will give: OTNT

An exception

to thts rule exists when p G q. In this case, the “print”

~111be the keyword.

3.2 Improvements of thr algorithm The statistics that we have developed u priori indtcate the risk of synonymy of two “prints”[8], but do not indicate the possible defects of the structure of the monograms. Just as we have a length of fixed print, we might have particular cases at the level of the distribution of the same monograms (or digrams. trigrams) in the same position of the print (defects which should come from the structure of the keywords). In these conditions wouldn’t it be superfluous or insufficient to take a string of x monograms as a print? Thus, it is at the level of the algorithm. which builds the print that we have had to avoid the defects of distributing monograms in the keyword. This implies that we have had to develop different statistics at the level of the print, particularly in its entirety, because the natural language is in fact an estimation of an ergotic system. but we shall have on no account such an approximation for the set of “prints” (It is obvious that it must be the contrary. In fact to have the maximum of combmattons. the monograms must be distributed randomly in the set of prints.) Let’s take a simple example. We have noted that the vowels appear at a rate of 34% in secondary position in the prints. We managed to power this “rise” by improving the algorithm thanks to the statistic at theory developed on the set of prints. This proved sufficient at first. But very soon we saw that we were no longer controlling the other defects which were developing in other respects at the level of the print. So as the print has a fixed length, we thought of a statistical theory which connects the position of the monogram to the monogram itself. We have thus created a mathematical implement, which enables the distribution of monograms in each position of the print to be known for a set of keywords. This is summed up in the following schema:

Good

olgorlthm

4 mathematical

formulation

of keyword comprewon

195

for thesaurl

We have not presented a complete list of all the improvements that we can make on our algorithm. They are numerous and we have missed certainly a few. So we shall only quote some of them. the most important, that we have experimented on in several different sets

First improllement : This first improvement will be immediate and will consist in avoiding in the algorithm the particular zones of the element e of sets E. Let’s take two examples: -50% of titles of books begin by an article. thus there is a strong probability of monograms “I, e. 0”. . in this zone. -in the names and the keywords. the monogram “space”, is always in the same zone.

Second

improtlemenf

By this algorithm, we cannot have in the set of the prints. monograms which are not to be found in the set of identifiers. To avoid this inconvenience which was diminishing the potential number of monograms, we have given each a “weight”. (a) Notion of weight, first form. This notion implies that a monogram may only appear in the string of characters of a print a limited number of times. Let n be this number. We substitute the (n + 1.. . . n + m) redundancies of this monogram for rare or special monograms (Y. W, ?, ! . .). Each monogram is linked to a specific list of monograms. Example for A:

?willbethe(n+l) ! will be the (n + 2) With this method the following print: ABJAAA will become: ABJ?!-. The method i< particularly interesting if the “print” is long. Remark: To avoid a redundancy too strong in certain zones (for example the “-” in 6th position), rules will be used to apply this method to the beginning of the print. For example, as 50% of titles of books begin by an article, the method in this case will apply to the beginning of the print. For example: the print AAREAA will become: -!RE?A (b) Notion of weight, second form. This notion is linked to the frequencies of a monogram in the set of the identifiers and to their positions in the prints (see the following table).

I

! 3 -l

Space

.4

‘1

1

0

J ):

;

E

I

S

R

B

4 F

7 R

OK$/

I

?SP%

M

L.

T

+

c f

Y

.;368W:,.

This method is applied on short prints. To sum up. these improvements have a triple advantage: -to operate on the interval of redundancy of monograms so that they are all equal -to introduce monograms which do not belong to the set of identifiers -to create a balance between all the monograms used in the set of identifiers.

3.3 Application In applying our function

f. with the following hypotheses

and the second improvement,

to

I Yh

I. 4,

(ii,

the set of keywords: Card ( T I = 3000 C’= 64 q = 4j L’= {I’,l’21’11’~} 3h e.. L 3 the step A c. L/3

1 _

p = ‘5.

Example\ (‘ONTROLELIR * ENTREPRISE * GESTION FINANCIERE * INTEGRATION / Pi.4NIFIC‘.4I‘IOY BENEFICE * CORRELATION * .4CH.4T * REDOND.-ZNCE * COMPARATEI’R ENERCETIQUE * GAIN MENSUFL * TAU\( HORAIRE * INVESTISSEMENT PL.4NIFIE * RETRAIT * TRAITEMENT .4BATTEMENT DE ZONE + INDEMNITE JOURNALIERE * PONDERATION DES PRIN * PI ~4NlFIC‘;\l‘lON DES C‘OLITS x RUPTURE DE STOCK * ECHE4NCt * CIRCUIT FINANCIER * FINANCEMENT DES ENTREPRISES * RATIONNALISATION DES CHOIS BUDGETAIRES ANALYSE ECONOMIQUE * ECONOMIE DE L’INFORMATIQUE * AUTOFINANCEMENT * BANQUE DE DONNEES + COMPTABILITF ,ANALYTIQUF . COMPTE D‘EXPLOITATION CONTROLE DE GESTION 1 COUT * OPER.4TION FiNANClERE * PRIY DE REVIENT *

I T’7 \ PlZli I(& \ TRIM NC“ 11’ ‘r RLUk C VI I DWNR R Fl MZL ‘iHIU I \I Q\ I’RP( I NQ TH’Zr; UK\‘M II.SX FY P :m WFC’I HN 1 \‘I’;( I:/\ + SU(~,Z YCV’L PYCM PKFT VI EQ NRYB RPTC; RD\VT COL’T \,NR YIVI.

We have not had synonym\ ((“I = 64J # 16.5 M of pos\ibihttes of random “prints”). We have tried for q = 33~7 = {18,I~~I’~} and we have obtained a rate of synonymity of I/ 1000 ( 1for thousand). This is negligible (see examples on the following page).

j

Fl,4B()R.~[loN

01.

,\

‘l:\Hl

F-IHESALRI’S”

(It

-00

BY JE\

With such a rate of compression, we demonstrate that a thesaurus must be stored entirely in memory. To introduce our method of storage and the retrieval of a keyword in the “TABLETHESAURUS” we describe very briefly what an arborescent file is.

4.1 Method of the arhorescent file It is a classic technique. Each character is connected by a link. There is a character by byte, the others are used as pointers (of the following or alternate character. a d indicates that it doesn’t exist) or as an end of word ((a * in the 4th character of the byte). Example: Let’s store COBOL. COCA. COCO

.4 mathematical

formulation

of keyword

197

compression for theraurl

bytes

This method has particular advantages in its use and it avoids a file being too greatly increased when there is insertion of a new word. The method however has notable inconveniences: -the retrieval takes an appreciable time when we must determine if it is a long word. -the rate of memory insertion is very important (it is 3 times as important as any other method). -operation of the programming is rather complex. More elaborate methods have been developed by Knuth, especially thanks to binary trees. But let us say that nothing has fundamentally changed (variable length, double link and a rather complex research procedure. . .)[ 171. 4.2 Elaboration of the “TABLE-THESAURUS” Our method avoids all these inconveniences and improves the qualities of an arborescent file because: -the words are fixed, we have no mark for the end of words -the words are very short in length. These two points have enabled us to find an original structure. In fact, the file will be made up of 64 words of 64 bits each and this twice (if we consider a reduction of a keyword on 3 characters). A 64

0

bits

We obtain a file at 2 levels, where the absence or presence of the bit indicates the presence or not of the character in the logical succession (the alternate pointer no longer exists). Example: Let’s store BAB

f1rc.t level

0

,

0

_____________

0

0

____________________----__-_

Thus, at each level, for each character,

0

second

the bit indicates following character,

level

so if we have

L 4.

IYS

tit,

prints of length y the number of levels will be: F,,-,llj If we store B.4B. ABC & .4BC, we shall have in memory: A I

I 0

.i

0” ~:,:

B

0

. .

2-0.0

A

ftrst level

0

secondlevel

l3

i

0

0

Six of the “T,4BLE-THESAURUS” With a synonymity rate of l/1000. this table will occupy (64~ 63) g 2/8 # 1200 hytrs of m~tnory. an important fact. But we can reduce this table leading to more security in the synonyms. With y = 3 we have 64’ # 260,000 possible combinations. or if we take y = 3 but with ;i set of 30 monograms we shall have 30’ = 810.000 possible combinations. so a rate of synonymit! below l/1000. In these conditions our “TABLE-THESAURUS” will occupy (30 *: 30) x 3/S = 340 hates in memory whatever the number of keywords is in a normal thesaurus. while attaining :i very important fiability.

4.3 Prohrlhility of “noise” The “noise” in our \y\tem ha\ two or&\: -the “noise” which comes from synonyms -the “noj\e” which come\ from the Gructure of WJ “TABl,E-THES4URUS“. We shan’t come back to the first point. it is however interesting to develop the second because if this “noise” IS too important. it can render this table structure Inadequate. In this way, the “TABLE-THESAURUS” can produce keywords which in fact do not exist in the thesaurus because keywords can have common endings. ,4 cl:t\cic c~~lcul~lt;~~n\how\ that the prob~~bility of ;I synonym in ;I “TABI.E-THF,SAURCS” with three level\ for ;I corpu\ of 30 monogram\ i\ ahout I. f/loOO. which i\ next to nothing.

This probability increases slightly if the compression rate decreases. a “print” of 5 characters. the noise increases by about 1/8~0,000.

This noise becomes ‘110.000.

insignificant

For example if we take

if we work with a print of 3 character\.

it is then about

4.4 Elahorcdion of (I synmetricd “TABLE-THESAURUS” This noise may however prove a serious drawback and we must suppress the defect by building a table of the same size, but symmetrical to the T,4BLE-THESALrRLfS. For example if we have ABCD in the symmetrical table we shali store DCBA which will remove an! doubt as to the identity of the true keyword in the thesaurus. Suppose a keyword exists in the thesaurus and its print is ABCD. If we look for keyword< whose prints are DACD and BCCD and which do not exist in the thecaurus. we shall nevertheless be able to find them (provided of course that other existing keywords begin by D,4
A mathematical

formulation

of keyword comprewon

for thesauri

199

If we use a symetrical “TABLE-THESAURUS” this ambiguity is cancelled because in the latter we shall only have the print of the true keyword which is in this case DCBA (see the following picture).

m ________-_____

existmg keyword non extstmg keyword

With this symmetrical “TABLE-THESAURUS” 1.21/10” (1 on one million), thus next to nothing.

the rate of noise synonymity

is then of

5. CONCLUSION

Our function f of concentration applied on the set of keywords of a thesaurus, permits us to draw remarkable results: -whatever the length of a keyword may be, it can be reduced (c = 30) to 4 characters with a very small synonymity risk of about l/1000. -this allows us to build an original table structure called “TABLE-THESAURUS” with less than 700 (seven hundred bytes), where we can store all the keywords of a thesaurus -the time of retrieval of a keyword (one test for 6 bits) is instantaneous. This technique suppresses: -the transferences of the parts of thesaurus between the memory and the peripherials -the tests on each character of a keyword as well as on the different pointers. -the management of this “TABLE-THESAURUS” raises no difficulty (no pointers. no work-ends). We have applied this method[22], and the results have verified on all points what the laboratory experiments have shown. So, given that the system query is correct in natural language, such a work-instrument like the “TABLE-THESAURUS” is useful. A keyword or an athematic term may be recognized instantaneously. With such a considerable compression of a thesaurus therefore we can bring a very great improvement into the organization of the documentary system and thus considerably decrease its cost in use (on the storage and on the time of retrieval), which is now holding back the development of automatic documentation.

XII

L.-G

GI?

We can quote the following significant results which have been experimented and which are used[31]: -possible reduction of TITLES-AUTHORS of a book on IO characters for a volume of 3 M books (it is obviou\ that for a small library the compression rate would be greater) -reduction of names-Christian names of persons on 4 characters for a file containing 20.000 headings -this last experience has led us to experiment our method on the hash coding. We reach the creation of a quasi unique function H. This allows us to keep all the available room in the memory. for we have no more synonyms to manage -elaboration of a tool for comparing texts written in a natural language different or otherwise. Application on the detection of young social misfits model -formalization of a computer research through an “Information-Decision” -construction of 3 function f ’ to retrieve the original information with a rate of compression of 7O”r. REEEKENCES 1I ] [2] [3] IA]

R W Bt MrK. Do II h) the number. DIgItal shorthand. Comtn .ACM 1960. 5%536 L l301.l IFT. Notatlon et proce\sus de traductlon de\ langages \ymbohqur\ Th&r Grenoble. pp 412-A?? (June 1967) C BlNRhF and D FORU. A study of method< for \y\tematlcally Engh\h wordr name\ J .4C’M 1961, 225-119. H C‘IIK(,I~K. I;ne mithode phonetlque de re\herche et de lm~\e it Iour \ur tiLhlrr\ multiple\ ou de ma\\e Trurd et \Irtkdt3 lY71). 3Y-II

161D DF M%IN~ ,md .A MIKRON. .Automatlc data compre\wn Comm. ACM 1’267. ‘II-715 171 A Dr-w Tut. Etahll\wment et e\ploltatlon automatlque de fichw\ de cltatlon, blbhographlque\ &r/l L/NESCO ‘iv111 1464 I. IOI-IOX [XI J DINIO. Le prolet 4IDE ParI\. AN P E IDec lY731 191 R F,AJ\I~~ and .4 B~IK(,FI r. WYLBUR an mteractlve text edltmg ,md remote Job entry \y\tem Comm .4CM 1973. 31&3x? IO] W D FR\L~K. Compreswn Pntwn of Computer tile Data. I FI LrS .4-J-\P.AN C’omprrter C’onfrrmtr ( 1972) I I I H (JI,III IFH..Acl,l\\dicatlon of compre\w,n method\ and then u\efulne\\ for (1 I,lrper proce\\lng center Nutroflo/ ~‘(Inlprrlrr Cl)lrfertw~ r ( I Y75 1 121(‘ (;III K~I ,ind T B IY \KII. Etude du dibeloppemcnt J‘un \k\t;‘rnr document.ure automatlque pour I’mformatlon chtmlque en proprlit& mdustrlelle DIAPASON. MemoIre de fin d’itude INTD (Del: 19741 131 W HA(,IMFU. Encoding verbal mformatlon as umque number5 IBM System\ JII. 4 ( 1973) IdI B. HAHU, .A new techmque for comprewon and storage data. Comm .4C,ZI 197-1. 13G36 151 J HI\I ttl~. Utdl,atlon d’un Calculateur en ?tatlstlque The,e 3e cycle Grenoble (June I9701 161 L[’ HONIFU. A file management \)rtem for large corporate mformntwn uy\tem data bank Fall Jnmf Computer (‘onfHwlcY 1968. 33, I45- I56 171 D K;UI’rH. I% Art CI~C‘ornprrtrr Pro~rc~mmrn~. Vol 3. Chap 6 .Addl\on-We\le). New York (19731 181 G Lot ~\-Gi\cl, Etude mathimntlque pour la wncentratlon de fichlers occupant un volume Important. Statlstlques mformatwnnelle\ AFCET R! pp 101-l I I 1lY71) 191 G LIII ~\-C,ivr r. Etude mathimatlque pour la concentration de fichier\ ocLupant un volume important (%me partte). Out11 mathimatlque .4FCET RI pp. 71-X0 f 1972) 201 G. LOI lGG4YtT. Etude d’un &nlthme pour rPdu,re de, hchler\. .4FCET El pp 17-30 (19731 !I 1 G L~I[ I\-G \vt I. Compactage de donnee\ \tructur&\ Contrtbutlon\ i la conception d’un syrtcme d‘mformattons iompwe de lichler, multlpler et volummeuu There L)on (June I9741 221 G I.~)L;IGG~\ FT. Elaboration d’un \y&me documentwe a)ant comme Iangage d’mterrogatlon, le langage nature!. Rapport contrat Clntte Hermitlque Lyon (June 1975) I!] Multllea\mg The HASP S)5tem IBM Puhl pp 1139~1153 ~Feb lY711 241 B M 01 ivbK. Efficient codmg-hell \y\tem i”ethn J 1952. 2114). Z&751 2.51 J Rt r~~t+\~ FH. Ftle ordermg and retrieval Lost mformutlon. Storu~u. Rrfritwl Pergamon Press. New-York (1971). !6] M t&ttz, Some method\ for cl,~\\dicatton and .mal!\ or multlv,wable uh\er\atton\ Sfh Berkelep ~~mpo.\rum we hl~~rherr~ut~c~c ud Prohddrlv Vol I. No I I 1967) 271 G .$41TOM. Automatic Information. Orgam\atlon .md Retrieval. Chap I. II. III McGraw-Hdl. New-Yorh (19681. 281 T SA\’ \(,F. 4 note on the evaluation of method5 for \y\tematlcally abbrewatmg Enghsh words. .4m Docum. 1973. I& 11 ?Y] J l%~ I. B L~K\II~ ,md .A LIUII. lnformatton retrle\dl wth the ABACUS Prugram InternatIonal Atomtc Energy Agent). Vlenne IIY?!)