Prolog on a RISC: Implementation and evaluation

Prolog on a RISC: Implementation and evaluation

Microprocessingand Micropmgramming32 (1991) 497-504 North-Holland 497 PROLOG ON A RiSC : i M P L E M E N T A T i O N A N D E V A L U A T I O N Gill...

847KB Sizes 2 Downloads 122 Views

Microprocessingand Micropmgramming32 (1991) 497-504 North-Holland

497

PROLOG ON A RiSC : i M P L E M E N T A T i O N A N D E V A L U A T I O N

Gille$ BERGER SABBA TEL Abderrazak JEMAI TIM3/IMAG-INPG, 46 Av. F. Viallet - 38031 GRENOBLE CEDEX - FRANCE

ABSTRACT This paper discusses about the compilation of Prolug on a RISC processor. This is part of the Symbion project, which aims at designing the basic processing element for rnodu|ar parallel symbolic machines. The compiler is based on a variant of the WAM. First, some results are presented with respect to the emulaticn of the abstract machine. Then, we comment on the code generation on a RISC processor (the MIPS), and give some results obtained from the execution of Preicg programs on a MIPS emulator. The performances obtained on a 20 MHz clocked MIPS are close to the performances of WAM-besed Prong machines.

1. INTRODUCTION The design of Prolog machines has been a research topic for several years now. The first project was oriented toward the hardware implementation of a Prolog interpreter [YYT83]. Then, several projects dealt with the implementation of an intermediate language, the WAM (Warrens Abstract Machine) iNaNe7, TiW84]. More recent projects are based on a RISC approach [Mi189,SeY87]. Most of these projects lead to speciaJ-purpose designs. The WAM is a very complex specialized machine : some instructions have a non-bound execution time, and reference arbitrary memory space (dereference, unification), as they process complex data structures. RISC based designs often implement specialized features such as tog checking hardware or shadow registers. However, there is still an open question : is it worth designing special purpose architectures when general purpose processors performances improve so fast? Many studies show that the speedup obtained with special architecture is insignificant, compared to the fastest processors on the market. With the evolution of software implementation techniques, special purpose features (type checking, dereference) becomes less attractive. Furthermore, specialized architectures can hardly be adapted to the evolutions of the language itself (such as constraint logic programming [Cohg0]).

Hence, our feeling is that, rather than d ~ n g specialized architectures for Pmiog, it ~ be useful to find what can be modified in genera] purpose processors to improve their e~cien~'y for icgic programming |anguoges, without an excessive increase in their complexity, or a decrease of their generality. This led us to study the c o m p i l a ~ of Preicg on a RISC processor: the MIPS. Other studies has been done on the c o n ~ o n of Pmlog on the MiPS ['i'aygO], and rasuited in performances superior to Pmicg rn~ines. However, they wore based on high levai optimizeticns of the cempiia~on, in our work, we took ctassic~ compiling technk~uas, and gave particular attention to the low level aspects of the compilation on the MIPS. This work is part of the Symbicn woject, which aims at designing a processing node for medlar symbolic machines. We will give a ~ f presentation of it in the next section. Afla~,ard, we wi, comment on the choice of the MiPS for this study, and fina~y give some rasuits about the compilation of Prok~g on this processor. 2. OVERVIEW OF THE SYMMON PROJECT The goal of the project is to d ~ n the processing element for modular paraU~ symboUc machines - we call this element Symbion. S ~ can be connected to build n e ~ s , such as 3 dimensionaJ rogular arrays. C o m m u ~ is achieved through .~gh ~;peed sedai links, as in the Trensputer, so that ne~.worksof any tapaiogy can be constructed.

498

G, Berger Sabbatel, A. Jemai

The communication between the processes, and the garbage collection shoul~ be integrated in a distributed virtual memorj system. Or-parallel Prolog processes will share their environments by virtual copy, so that only us'~ful data will actu~,!ly be transmitted. The architecture of the Symbion is depicted in Figure 1. The basic elements are : • A symbolic processor, programs.

which

execute~ the

-A local memory, which contains the basic subroutines for the processing. ° A memory processor, in charge of caching, and address translations. We intend to design a cache and address translation system taking into account the specific behavior of Prolog programs. - A global memory. - A communication processor, in charge of transmission/reception and routing of messages.

I s,~"c ! Processor I

t

,I . I oT!

Rgure 1. Architecture of symbion So far, we have concentrated our research on the symbolic processor. We claim it could be possible to find a few optimizations to significantly improve the eff'miency of standard. RISC processors for symbolic languages such as Prolog, without an excessive increase in their complexity. Our first task was to implement an emulator of a variant of the WAM [Van84,War83], the TWAM (intended for parallel implementation on a Transputer Network). By profiling this emulator, we can get a first idea of the most time consuming operations of the TWAM for various benchmarks. Then, a translator from the TWAM to MIPS assembly language was written, carefully optimized, and evaluated on a simulator of the MIPS. 3. CHOICE OF A RiSC PROCESSOR The characteristics of four RISC processors are summarized in table 1. The reader should refer to [Pat85] for a more complete discussion on RISCs. The architecture of the memory system is an important characteristics, especially for symbolic languages such as Prolog. On the MIPS, there is a multiplexed bus, so that two memory accesses (instruction + Data) can be executed in a single cycle (with separate instruction and data caches). On the

Processor SPARC MIPS M88K Am29K I inslruclions 76 74 130 112 registers 136 32 32 192 addr. modes 5 3 7 4 instr, formats 6 3 7 6 Condition code yes no no no Delayedloads no yes no no Delayedbranch opt. yes opt. yes Window regs. yes no no stack ! On chipMMU no yes no yes Inst/databus 1 1 2 1.5 TABLE 1. Characteristicsof RISCprocessors Am 29000, there are two buses for instructions and data, but a single multiplexed address bus. The M88300 has a Harvard architecture with two fully separate buses for instrucUons and data. Only the Sparc has a single non multiplexed bus, so that a data access needs at least one extra cycle. We choose the MIPS, as it is the simpler and more regular processor. The M88000 has interesting features such as bit field processing, but it is very complex. The Am29000 has an interesting register file (which can be used as a stack), but the only addressing mode for data is register indirect, which means that most addresses must be computed before the access. Delayed loads are an original feature of the MIPS : when data is read from memory, it is loaded in the target register with a delay of one instruction : so, the data will only be available for the second instruction following the load instruction. Any instruction not using the loaded data item can be put after the load. If no useful instruction can be used, a no-operation must be inserted. It must be noted that there is also a load delay with the other processors : it is due to the pipe-line organization, in which the writeback of data read from the memory is performed after the fetch of the operands of the next instruction. Two other solutions exist : • The processor pipe-line can be suspended dudng the access (SPARC), which means that a !~ad always needs one extra cycle. • The loaded register can be locked until the memory access is completed (M88000, Am29000): if the instruction following the load uses the target register of the load as a source operand, it must wait for the completion of the memory access. Otherwise, it can be executed without a wait cycle. Therefore, the only two drawbacks of delayed loads are the memory space wasted by the no-operation inserted after the load, when no useful instruction can be put there, and the waste of memory bandwidth for muitiprocessor architectures. Delayed branches are a feature common to most RISC processors : when a branch is executed, the instruction following the branch instruction is always executed. A no-operation can be put after the branch, but in must cases, an useful operation can

Pmtog on a RISC

499

be placed there. Some processors (Sparc, M88000) have branch modes where the instruction following the branch is discarded if the branch is taken : in this case, there is still a one cycle delay, and the only advantage is to avoid the memory space and memory bandwidth wasted by the no-operations.

(a) Me0",.ofyspace(by~,eS) code heap ~ 524 3960 728 804 2660 208 1920 64 128 1804 1584 700 2620 780 1124 3552 1472 1356

nrev qsort ¢~efy dedv farn~r monkey

4. THE TWAM The TWAM is a variation of the WAM, intended for a parallel implementation on a Transputer network, developed by J. Bdat & AI, from the laboratory LGL The main differences with the WAM reside in specific instructions for optimizing particular cases, and some hooks for the implementation of OR-parallelism. We implemented a sequential emulator of the TWAM, and profiled it. The following benchmarks are used : • nrev (nr), qsort (qs), deriv (der), query (clue) : the classical benchmarks from D. Warren [War77] (naive reverse, quick sort, derivation, data base query). • farmer (far) : a solver for the classical farmer problem. • monkey (mon) : a solver for the monkey and banana problem. The characteristics of the benchmarks are summarized in tables 2a and 2b. The 2nd column of table 2b (mem) gives the average number of data memory accesses per inference. The code size is for a 32 bits encoded TWAM (every instructions is coded on one or several 32 bits words). Table 3 presents the time spent by the emulator in the most time-consuming operations (or family of operations). For the purpose of profiling, some simple operations (dereference, type checking, trail...) have been transformed in subroutine, which increase their execution time.

tr'~l 0 904 32 248 36 348

(b) Operaionsexec~ed |r~rences TWAM Program Number Mere. h-~s. nrev aport query

498 387 704

16 52 90

4100 5443 18875

dedv 75 44 1122 farmer 183 45 2884 monkey 204 28 2054 TABLE3. o~r~edstics of Prong benc~-'r~ks . ~ can be expected, a .~gn~cant amount of time is spent in the dereference for every benchmad~ Usd/7 and get instructions are also very t i ~ m J n g , but they do not execute real unifications which are executed by the unificmion f u ~ . Ordy the more complex benchmarks (farmer and monkey) spend a signit-cant amount of time in this function. Type checking is another significant operation. The write/t predicate account for 6.5% of the pmcossing time, but this is mainly due to the d~r/v benchmark which spend 30.5% of its time in this ~ . 5. COMPIUNG PROLOG ON THE IMPS For the compilation, we used the TWAM compiler to product TWAM code, and d e v e k ~ a TWAM to MIPS assembly language translator. The execution model is basically the same as d e ~ in [War83|. The same algorithms as to; the "RNAM emulator have been used.

dereferense 19.6 17.2 15.2 i9.9 17.5 unifyinst. 31.0 18.9 8.6 9.4 18.1 get inst. 16.9 9.7 10.5 7.9 10.4 typecheck 8.9 9.7 8.1 12.3 9.3 p_wdte 0.7 0A 30.5 0.2 3.8 p_adth 0.0 9.2 0.5 21.1 0.0 switchinst. 6.4 2.7 4.3 6.1 3.8 put inst. 2.7 4.8 2.4 5.0 7.4 try inst. 0.0 6.3 6.2 0.0 4.4 backtrack 0.0 5.3 O.O 6.1 6.0 ftrail 4.6 3.6 3.3 5.7 1.4 I unification 0.7 0.8 0.5 O.O 5.5 adr. tag 3.4 2.7 2.9 1.7 2.7 execute 3.9 2.1 1.0 0.0 0.6 escape 0.6 2.7 0.0 2.6 1.6 call 0.5 1.1 1.9 0.9 1.4 allocate 0.5 0.4 1.9 0.0 2.7 TOTAL 98.6 9 7 . 7 97.6 98.9 97.0 TABLE2. Measureson TWAMemu~tor

13.6 15.4 7.6 7.8 3.4 0.0 6.0 6.3 6.1 6.5 2.6 9.7 2.3 2.1 1.0 1.8 1.0 95.6

17.2 16.9 10.5 9A 6.5 5.1 4.9 4.8 4.2 4.0 3.5 2.9 2.6 1.6 1.4 1.3 1.1 97.7

500

G. Berger Sabbatel, A. Jemai

The type checking is implemented as follows : addresses are coded as positive values (bit 31 = 0), and different address types are differentiated by their two low order bits. A reference address has two low order bits equal to zero; list addresses and structure addresses have low order bits sets to 10 and 01 respectively. Other types (integers, atoms,...) are coded with a tag in the high order 8 bits, the bit 31 being 1. Hence, an address can be easily identified by a single sign test. However, a more simple and uniform coding could be used. The programs are executed on a MIPS emulator, which allows us to get some measurements, and will later allow us to evaluate possible modifications of the machine language. A first version of the translator has been implemented in 3 months, in this first version, TWAM registers are stored in memory. The instructions are ful!y generated in line. This version allowed us to run a few benchmarks (qsort and nrev) and to measure the TWAM registers utilization. This version ran 140 Klips for the naive reverse, based on a 20 MHz cleck cycle. A second, more complete version was then carried out in 2 months. The optimizations includes the allecation of three TWAM registers to hardware MIPS registers, and the transformation of some instructions generated in line into subroutine calls. More complete and precise evaluations have been carded out. This version ran 202 Klips for the naive reverse. Then, further optimizations were carded out, leading to the current version, which runs 522 Klips. The total manpower for the current version, including evaluable predicates written in assembly language, is about 12 man/month. However, a new version for a different RISC processor would be developed much more quickly. 5.1 Register uti|ization The TWAM register utilization has been evaluated by tracing the memory accesses in the first and second versions. Only partial results have been obtained from the first version, as it only allowed the execution of two benchmarks. The results are summarized in table 4. The results obtained from the first version clearly allowed the choice of h (heap pointer), and s (structure pointer) as first registers to allocate to MiPS h ~ d w a r e registers. The choice of other registers needed further measurements on the second version. E (environment pointer) is cleariy one of the most used registers, even though it is hardly used in nrev. B (backtrack point), tr (trai0 pointer) and X1 (first argument register) are also widely used. in the current version, registers have been allocated to: - The heap pointer (h),

reo nr qs h 3785 3739 S 2760 2388 e 308 3744 tr 934 1470 b 32 1808 es 94 1435 cp 128 986 X1 2362 2529 X2 152 1433 X3 930 980 X4 870 1022 TABLE4. • • • • • • • • • • •

que der far ? ? ? ? ? ? 10362 397 2183 8 9 7 7 390 1134 1354 406 1731 2 3 5 7 225 1277 4 2 8 6 187 1020 9 6 5 6 418 2015 8 3 2 8 237 1186 3 8 0 0 243 104 3800 155 Registerutilization

mon ? ? 521 513 564 454 352 672 503 259 135

the structure pointer (s), the environment pointer (e), the trail pointer (tr), the backtrack pointer (b), the continuation pointer (cp), the environment size (es), four argument registers (X1 to X4), the constant NIL, the constant 0x80000003 (tag mask), the heap limit pointer, the local stack limit pointer.

The two last registers are used for stack overflow checks. 5.2 implementation issues 5.2.1 Subroutines Some TWAM instructions are generated as subroutine calls in order to reduce the code size. For some instructions, this obviously increases the execution time, but in most cases, the time is reduced. This is due to several facts : • Many instructions dereference their argument. When the instruction is generated in line, this is done by a call to the dereference subroutine. When the instruction is executed by a subroutine, the dereference is expansad in line, so that the overhead of calling a subroutine is saved. The same is true for other subroutines, such as trail, so that the global effect is to reduce the execution time. • When an instruction generates a branch to the backtrack code, this requires 4 cycles : in fact, the conditional branch instruction has a target address specified by a two bytes word offset allowi,lg displacements of +128 Kbytes from the branch instruction. It can never be guaranteed that the backtrack subroutine lies in this address range, so that we must generate an indirect jump : beg 53, $4, success nop # branch delay (can be filled) imp backtrack nop # branch delay (can be filled) success: # processing if success

To avoid this, the subroutines for the instructions which can cause backtracking are grouped in a single module, which is loaded close to the module

Pro/og on a RISC

containing the backtracking. Hence, we can be sure that the offset wifl be less than 128K, and a single conditional branch to the backtrack code can be generated.

501

loop: add $3,$4,$0 # save previous v a l ~ bne $3,$4,1oop # loop if vaiu~ ohaD~je IWC $4,0($4} # duplicate load to fill branch delay

5.2.2 Dereference

The dereference appears as a basic operation in Prolog. A naive implementation of it in the MIPS assembly language could be the following : # the value to dereference is in register $4 deref: bltz $4#end # check if not pointer nop # branch delay andi $1, $4, Ox3 bne $1,$O,end # check if sir/list ptr nop # branch delay loop: add $3,$4,$0 # save previous value lw $4,0 ($4) # dereference pointer nop # load delay # a free variable is represented by an autoref beq $3,$4,end # check autoref nop # branch delay bltz $4,end # check if not pointer hOp # branch delay andi $1, $4, Ox3 heq $1,$0,1oop # loop if variable ptr. hop # branch delay end: In this version, the dereference takes 10 cycles for each loop. As the value 0x80000003 is stored in the register 23, a single test can determine if the value is a variable pointer (if $4 & $23 yields a zero result). The intersection of $4 and $23 can be put in a branch delay slot, as it does not destroy useful information. And the save of the previous value of $4 can be put in the load delay slot, as it is still available. So, we get the following program : deref: and bne hop loop: lw add beq and beq hop end:

$1,$4,$23 $1,$O,end

$4,0 ($4) $3,$4,$0 $3,$4,end $1, $4, $23 $1,$0,1oop

£2.3 Other optimizations

Other optimizations have aLso been ~ : • programming optJmizations, • optimization of some sequences of instructions : when a value loaded in a register by an i ~ is needed by the next instruction, it is no longer loaded ogaJn : this is the case for proceed of execute instructions foll~,ving a dealSocate. • when the first argument is dereferenced by a switch instruction, the dereferenced vaJue is kept in a register and can be used by a get instruction. • optimization of unify sequences : a special code ~s generated after a puLstructure or put_~st (in write mode). After a get, we avoid the mode test by generating first the write part of the unify operations, and then the read part [Tur86]. 5.3 Performances The performances of the current vers~:n are s h o ~ in the t a b l ~ 5a and 5b. The M|PS code size is given in bytes. The third column (size, M/W) gives the ratio between the MIPS code size, and the 32 bits encoded TWAM code size. The third cohJmn e,F tame 5b (cycles, MP,~/)gives the average number c: cycles per TWAM instruction. The performance 522 Klips for the naive reverse be~-mhmerk. (a)cede.,~ze pco~r~ M~Ps ea/w m'~v 1788 3A

# check if sir/list ptr # branch delay

qeety deriv farmer monkey

# dereference pointer # save previous value # check autoref # loop if variable ptr. # branch delay

Now, the dereference loop only takes 6 cycles. In fact, the conditions for stopping the loop are : either we get an autoreference, or we get an invalid word pointer address (if we consider that addresses always have higher order bits to 0). So, the dereferance could be executed in only 3 cycles per loop, if we had an ipstruction which !oeds data only when the address is valid, and o iher,~%e is a nooperation. If we call Iwc (load word conditional) such an instruction, the code would be the following : deref: # dereferenoe, if address valid lwc $4,0 ($4)

ntev qsm't queff dedv falmer rnoQkey

2952 3648 6240 7384 9316

3.7 1,9 3.4 2.8 2.6

(b) Execm~ ~ cycks M/W leed su~es ~ps 19091 4.6 1617 1642 187 41225 7:5 4452 6604 1087 283590 15.0 3 8 3 1 6 17564 140ff7 9783 8.7 837 1697 343 29945 10.4 3542 2553 2514 24893 i?.l 2862 2 7 0 9 2005

TABLE& Ped'efnmcesof c.¢¢~ ve~,~ Wdh respect to the first versions, the p e r f o r m a ~ have bep.", highly increased, in t e r m s of speed memory size. The ratio MIPS/TWAM code size is remarkably uniform and appears a c c e ~ . On the other hand, the number of MIPS cycle per TWAM instrucUon vary widely, which can easily be understood, if we consider the differences in the behavior of these programs (table 3). The Io~Pstore ratio is close to 1 in most cases, and less than 1 in 2

502

G. Berger Sabbatel, A. Jemai

cases : this is a classical behavior in Prolog, where, due to backtracking, data are often written and then discarded without being accessed. The number of no-operations executed varies from 1% to 8.4% (average = 4.7%). This version includes stack overflow checks. Without these, the performance rate would be improved by at least 5% : one possible solution to avoid these checks could be to integrate them into the virtual memory management system. The sizes of the common modules are the following : instructions calling backtrack : other instructions : primitives (unification, etc..) : arithmetic predicates : other evaluable predicates : Total :

1216 bytes 556 bytes t t 68 bytes 1384 bytes 5852 bytes 4728 bytes

A small set ot evaluable predicates, including write/1 and is/2, have been implemented, in order to allows the execution of significant benchmarks. 15 instructions are implemented as subroutine calls, the other are generated in line. 5 A Distribution of the processing time Table 6 gives the distribution of the processing time between the user program, the subroutines implementing TWAM instructions, the basic subroutines (dereference, unify...), and the arithmetic evaluation function. "these results have been obtained through a trace of instruction accesses done by the MIPS emulator. Mod~e user insmscdons primitives addlmetic

TABLE 6.

nrev qs der que far mort 64.3 40.0 41.5 24.2 30.0 25.9 32.4 17.8 40.2 23.2 22.9 12.9 3.3 32.0 17.4 14.6 30.4 48.2 0.0 10.1 0.8 38.0 0.0 0.0 Distribution ofthepmcessing time per module

A first remark is that most programs spend less than 42% of their time in the main program. Nrev, which spend 64.3% of its time in the main program is a special case : it only uses basic features of the Protog engine. So, at least 58% of the processing time is spent in a set of subroutines whose total size is at most 10 i~ytes. Hence, the use of a fast local memory to st~,re these subroutines could be a good choice for a Prolog+ architecture : this would s~jnif'mantly reduce the common memory access rate, and improve the instruction cache efficiency. Ta~e 7 gives the distribution of processing time between the more time consuming subroutines. The dereference does not appear : in fact, most dereferences are expanded in line in other functions, so that its use cannot be easily evaluated. The unification subroutine is only important for farmer and monkey (as in t ~ e 3, with a slight increase in percentage). The weight of save_state and backtrack has also slightly increased.

Module nr qs der que far mort allocate 2.2 1.7 2.8 0.1 3.2 0.9 switeh_un. 0.0 0.0 8.4 13.4 1.9 2.0 geLconsL 0.0 0.0 2.5 7.1 2.2 0.2 get_list 27.6 10.7 0.0 0.0 3.4 6.4 geLnil 1.2 3.9 0.0 0.0 0.0 0.0 get_struct. 0.0 0.0 24.7 0 6.0 1.6 save_slate 0.0 15.2 10.7 0.2 6.0 8.0 trail 1.3 2.7 2.2 5.2 1.3 3.7 backtrack 0.0 11.3 0.8 9.2 8.5 8.9 unification 2.0 2.8 2.7 0.0 13.7 27.6 TABLE 7. Distribution of Ih¢ processing time per subroutine

Table 8 gives the distribution of the processing time per MIPS instruction family. The percentages are relative to the usefu' :'qstructions (without nops). Module arithmetic lui logic cxcl. or shifts load/store jumps cond. hr.

nrev 32.5 3.0 10.8 0.7 0.2 17.2 11.7 24.0

qsort 28.6 1.5 5.8 2.9 3.7 27.5 10.5 19.3

deriv 32.6 3.3 6.3 2.2 0.2 26.8 10.6 17.8

euery 19.4 2.7 8.0 7.6 7.3 21.6 10.3 23.1

farmer 25.9 3.4 8.4 5.8 0.8 23.1 13.6 18.9

monkey 27.6 3.2 8.4 3.9 1.4 24.9 10.3 20.3

TABLE 8. Dislributionof IheprocessingtimeperMIPSinstructions The exclusive or instructions have been separated from the other logical instructions, as it is only used for register to register transfer and immediate loads (although the same operations are also performed through arithmetic operations). The lui instruction (load upper immediate) loads the high order half of a register with an immediate value : it is generally used to load the tag part of a constant. Logic and shift instructions are most often used for tag processing. Arithmetic instructions are mainly used for address calculation and comparison. The branch operations (conditional branches and jumps) are very time consuming. However, the branch delays appear to be efficiently filled, as the hop ratio is relatively low. A few MIPS instructions families are not used : • set on less than, intended to implement register to register comparison (we used subtractions), • shifts with a variable shi~t amount, • unaligned load and store operations (lwl, lwr, swl, swr). • conditional branch and link. 5.5 Discussion With our compilation technique, a performance rate of 522 Klips has been obtained. This could be further improved : • optimization at the Prolog compiler level (mode and type inferences, etc...), • optimizations at the assembly level : the generated code is optimized at the WAM instructions level, but further optimizations could be done by a global

Pm,bg on a RISC

reorganization across the instruction boundaries : branch and load delay filling, removal of redundant operations... • with simple hardware support. These performances should be compared with the performances of Prolog machines. KCM [TNBgO] seems to be one of the fastest current WAM-based Prolog machines, it runs 760 Klips for the naive reverse benchmark, iPP [ABY87] has higher performances, but is based on an ECL processor, and should be compared with an ECL version of the MIPS. So, our implementation runs 69% of the performances of KCM, at a much lower price : KCM is a complex 64 bits beckend processor, whim the MIPS is a standard 32 bits standaicne processor, if we consider that this implementation can still be optimized, the advantage of specialized Prolog machines seems to be minimal. The machine language of a RISC processor such as the MIPS appears well suited to the implementation of logic programming languages. Most WAM registers can be allocated to MIPS registers, which greatly reduce the number of memory accesses (the most time consuming family of operations) : a large part of the performance improvement between first and current version is due to the reduction of memory accesses obtained through the use of registers. The load and branch delays do not really appear to be an issue. We could not evaluate precisely the percentage of time spent in tag processing, as it is distributed between several kinds of operations, in fact, it appears that most tag operations can be reduced to very simple sequence of MIPS instructions (most often 1 or 2 instructions), so that the use of specialized hardware for tag processing does not appears to be very interesting. However, the load of an immediate value often requires 2 instructions, due to the tag inserted in the high order bits of the data words. This could be simpler if the tag were pMced in low order bits (this solution will be evaluated in the future). The code size expansion ratio, with respect to a 32 bits encoded WAM appears quite reasonable (1.9 to 4.0) and will be improved on the next versions. With respect to KCM, the average cede-size ratio is about 1.8. ¢ CONCLUSION

modifications will be at a very low ~val, rather than at the level of WAM operations. The next study will evaluate the r ~ of hardware mo~flc~icns, aJong with the implementation of other software Q~rr~.~c~ns. The behavior of the processes in term of memory accesses (cache, paging) will ~se be ew~.~b~:L These evaluation witl require the execution of larger benchmarks, requiring the implement~¢~ of a more compMte set of evaluab|e predicates. Acknowledgements we woukl like to thank J~::quesBdat and C~udJo~ , b¢¢~ LGI Mbora~ry.for thek supportOffthe Prolog¢~mip~ we used, and the helpthey gave us for the imptemen~E~nof the TWAM en~lator. The Symbionprojectis a jein¢researchproject~ TiM3 and LGi laboratories. References [ABY87]

[Cohg0]

[Mil89|

[Na,~i37J

[Pa~85]

[SeY87]

S. Abe, T. Bandoh, S. Yan~guchi, K. Kurosawa and K. Kiriyarna, High pccr~ inte~-ated Pro~g ]PP, 14th int. Syrup. on computer architecture, Pitts~rgh, June 1987, 100107. J. Cohen, C o ~ r a i m ~ languages, Comm. ACM 33, 7 (Jubj 1990), 52-68. J. W. Mills, A high pertor~=wxe LOW RISC nmdfine for bgi¢ F ' e F m m d ~ , , Journal of logic programming 6, 1 (Jan. t 989), 179-212. H. Nakashima and K. N a ~ m a , Hardware architectm'e d Sequa~d ~rerence M a t ' h ~ : PSI-H, Tech. Pep. 265, ICOT, Tokyo, June 1987. D. A. Patterson, Redeced ~

set

computers, Comm. ACM 28, 1 (Jan. 1985), 8-21. K. Seo and 1". Yokota, PEGASUS: A RISC ~'ecesu~r for h|gh executhm or Pr~og programs, VLS! 87, , 1987.

[Tay90]

A. Taylor, LIPS on a M]PS: RemUs from a lRrolog ¢ompi~- for a RISC, Seventh lntematicnal Conference on icgic Programming, D. H. D. Warren and P. Szeredi (ed.), Jerusalem, June 1990, 174-185.

[TN890]

O. Thibautt, J. Noye and H. Benker, KCM: une machine Pt'o~g, 2eme SymposMn Amhitoctures Nouvelles de Machines,. Toulouse, Sap. 1990, 311332.

In this paper, we presented an implementation of Prolog on a RISC processor. The use of a simulator allowed us to present very fine measurements of the behavior of the programs. The performance of our implementation is 522 Klips for the naive reverse, and can still b$ improved. Our main goal in this study is to find modifications of a RISC machine language which bring a significant improvement of the performances of logic programming languages, it appears now that such

503

504 LTiW84|

[Tur86]

IVan84]

G. Berger Sabbatel, A. Jemai

E. Trek and D. H. D. Warren, Towards a pipelined 1P~'~ogprocessor, int. Syrup. on logic programming, Atlantic City, Feb. 1984, 29-40. A.K. Turk, Cemplier eptlndzatious for the WAM, 3rd int. Conf. on logic p~gramming, E. Shapiro (ed.), Springer Veriag, London, July 1986, 657-662. P. Van Roy, A ]Rrolog compiler for the PLM, RR 84/203, University of California, Berkeley, Computer Science division, Berkeley, Nov. 1984.

[War77]

[War83] [Y,rr83]

D. H. D. Warren, ImpJementing Prolog. Compiling predicate logic programs, Edinburg University. D.A.I. RR 39-40, MAI 1977. D. H. D. Warren, An abstract Prolog instruction set, Tech. Rep. 309, SRI international, Menlo Park, Oct. 1983. M. Yokota, A. Yamamoto, K. Taki, H. Nishikawa and S. Uchida, The design and implementation of a personal sequential ingerence machine: PSi[, New generation computing 1, (1983), pp. 125-144.