A language coprocessor as a HLL directed architecture

A language coprocessor as a HLL directed architecture

North-Holland Microprocessingand Microprogramming 24 (1988) 701-708 701 A LANGUAGE COPROCESSOR AS A HLL DIRE6-rKu A R C H I ~ E.H. Debaere Electroni...

527KB Sizes 1 Downloads 45 Views

North-Holland Microprocessingand Microprogramming 24 (1988) 701-708

701

A LANGUAGE COPROCESSOR AS A HLL DIRE6-rKu A R C H I ~ E.H. Debaere Electronics Laboratory, State University of Ghent Sint-Pietersnieuwstraat 41, B90O0 Ghent, Belgium High level language directed architectures (HLLDA) offer a good compromise in execution speed, program representation size and architectural complexity. 0ften such architectures consist of dedicated hardware and are heavily oriented towards one or more High Level Languages. Hence, the resulting computer is expensive and lacks flexibility. This paper presents another approach towards HLLDA's eliminating these disadvantages : a computer consisting of a standard microprocessor environment equipped with a coprocessor designed to enhance the interpretive process. The paper presents the idea, compares it with related work and discusses possible optimizations.

i. INTRODUCTION Since the first use of high level languages (HLL) for computer programming different approaches to support the execution of HLL programs have been proposed. References go back as early as 1967 [i]. Milutinovic et al. have categorized HLL supporting architectures into four main classes [2] : HLL Directed Architectures (HLLDA), HLL Architectures - Type A, HLL Architectures Type B and Direct Execution Architectures (DEA). Table 1 summarizes the basic features of these classes. In contrast to these HLL architectures, general purpose machines (GPM) are inherently not strongly oriented towards a particular HLL. But, because of their low cost and large flexibility, GPM's are mostly preferred when not restricted by specific application requirements. To directly support a HLL on a GPM, a (software) compilation step towards the native machine language is necessary and the resulting

compilation

code density may be rather low. In contrast, DEA's [3] require no compilation phase and hence no code is generated. However, in this case the consultation of the HLL source may be cumbersome and the hardware required may be complex and expensive. Furthermore, as Hoevel illustrates in [4], a compilation step is desired when considering combined execution-time/representation-size criteria. Usually, an interpretive technique is used to bridge the gap between an intermediate representation and the machine level at a low cost, but this technique often exhibits significant slowdown. A promising compromise is an HLLDA as illustrated in [5] and [6]. A software compilation phase into an intermediate level representation reduces the program size and increases the correspondence between the representation and the machine which on its turn favors the execution speed [73. Quite a number of HLLDA's have been developed. Examples are Lilith for Modula-2 [8], Symbolics

correspondence HLL statementsmachine statements

General Purpose Machine

by software

one to many

HLL Directed Architecture

by software

one to a few

HLL Architecture - Type A

by software

one to one

HLL Architecture - Type B

by hardware

one to one

Direct Execution Architecture Table I.

none

no machine code

Classification of HLL supporting machines

702

E.H. Debaere / A Language Coprocessor as an HLL Directed Architecture

for LISP [9], PEK for Prolog [i0], Deltran for Fortran [Ii], Adept for Pascal [12] and the CHILL machine for CHILL [15]. Some of the HLL directed architectures provide multilingual support, others are oriented towards a single language [14]. The HLLDA's consider hardware cost against flexibility and complexity against speed. In this paper we present and analyse an HLLDA consisting of a standard microprocessor (CPU), in order to reduce cost and increase flexibility, and a hardware device (called 'a Zanguage coprocessor' (LCP)) performing the interpretive process, to reduce the interpretive overhead. Section 2 presents this approach while section 5 compares the LCP/CPU concept with related work. Afterwards, some possible optimizations for the LCP/CPU architectures are presented (section 4) and concrete material is given to illustrate the feasibility of our approach (section 5). The concluding section (section 6) summarizes the features of our approach.

2.1. The interpretive process For simple intermediate instructions (P-code [16], M-code [8]) the semantics of the interpreter program I are very simple. The interpreter I repeatedly fetches an instruction f and decodes it. The m-instructions semantically corresponding to the instruction f are executed. The fetch and decode operations contribute to the interpretive overhead and take up considerable amount of CPU resources (registers, time, flags). In fact the interpreter program I is very special : it inherently knows the type of input data (1) and its output is a one out of a set of predefined sequences of m-instructions. As its functional behaviour is well defined and, moreover, the i's and m-series are rather fixed, simple, and low in number, it seems like an invitation to implement this functional behaviour in an extra functional unit L.

2.2. A processor for the interpretive process

2. THE LCP/CPU APPROACH

Interpretive systems use a program (interpreter) to consult a program representation and derive and execute the corresponding semantic operations. Using a notation for interpretive systems proposed by Hoffman [15] we denote a machine M (i.e. a programmable functional unit with known functional behaviour) executing instructions m of program I as

and a machine M executing instructions m from an interpreter program I on its own interpreting instructions i from a program P as

Microcoded machines M in interpretive systems may have (an) extra inner interpreter(s) which interpret the instructions m. These are interpretive systems with level higher than one [15]. For further discussion we only consider level-one interpretive systems, such as mentioned above, and we do not distinguish between hard-wired and microcoded M-machines predefined by the manufacturer. To fulfil our objectives of low cost and high flexibility we opt for standard microprocessors as executing machines M. From now on M (or CPU) also stands for 'microprocessor'. This choice is a heavy constraint imposed on our HLLDA approach : both the instruction set m and architecture M are fixed.

The device L described above is a functional unit which executes an interpretive program C (we assume L programmable) that is semantically very similar to I.

c l!>

r

Its input are instructions i of list P and its output are sequences of m-instructions corresponding to incoming i. Provided a communication path for passing the resulting m-sequences to M can be found, the interpretive execution of program P by the (L,M) machine is obtained.

The advantages obtained so far are : (i)

CPU resources allocated for the interpretive process are released and available to execute the m-sequences. This is important especially if the CPU has only a few registers or if the instruction set is not well suited to the /-process. The L-machine now fully manages the /-process.

(ii) A considerable speedup of the /-process may be achieved, depending on the performance of the L-machine. The simplicity of the /-process is reflected in a low complexity of the L-machine.

Furthermore, an extra overall speedup is obtained when parallelizing the execution of both machines M and L. Since in general the fetch and decoding of f-instructions is independent of the execution of the m-instructions, both processes may be performed simultaneously. 0nly the order of the f's and m-sequences must be obeyed.

E.H. Debaere / A Language Coprocessor as an HL L Directed Architecture

703

2.5. An LCP/CPU configuration We still face the problem of finding a fast communication path between L and M. Since the architecture of M is fixed, and is in fact a standard microprocessor architecture, the m-instructions must be passed through the CPU's instruction bus. A conventional way to do so is to write the m-sequences into the CPU's instruction memory and activate the CPU at the start of the instruction list. However, in this case, the instruction bus would be allocated twice for each instruction, once for generation and once for execution (unless true dual port memory is used). In standard microprocessor environments a better configuration is to locate the generation directly in CPU's instruction path [17]. Then the L-machine acts like a memory and generates m-instructions on demand of the CPU. Hence, for each instruction the instruction bus is occupied only once. (We do not consider dissimilarities between instruction length and instruction bus width.) The L-machine fetches the f-instructions (input data) from the CPU's memory, decodes them internally and generates corresponding m-sequences when M fetches instructions. In conventional Von Neumann architectures, as microprocessor systems usually are, both machines L and M access the same memory. Hence, this similarity with CPU systems equipped with DMA coprocessors inspired us to call the L-machine a 'Language Coprocessor' (LCP). Fig. 1 illustrates the described LCP/CPU configuration. Although the concept of a separate unit for execution and one for fetching and decoding of instructions is not new ([18],[19]), the application of this concept in a standard microprocessor environment is, as far as we know. Very similar are the Instruction Fetch Units (IFU) in dedicated HLLDA's. The next section discusses similarities with some published case studies.

5. RELATED APPROACHES The following comparison only deals with design decisions. Since the contexts of language, technology and design goals are largely specific to each case, we will not compare the HLLDA's regarding their performance and complexity. We restrict the discussion to the following three cases : (i) The IFU of the Dorado [18]; (2) The IFU of the G-machine [19]; (3) The Prolog Preprocessor [20].

5.1. The Instruction

Fetch Unit of the Dorado

The Dorado [18] is a powerful personal computer built of ECL circuits and is fully micropro-

f Execution

Decoding

J

J

s

Fig. i.

Diagram of an LCP/CPU configuration

grammable. To provide its processor with instructions at maximum rate of one instruction every 60 ns, the computer incorporates a special instruction fetch unit. Table 2 summarizes some important similarities and dissimilarities between the Dorado's IFU and our LCP/CPU approach. The table clearly illustrates the difference in techniques used to solve the same problems involved with a decoupled fetch/decode execute architecture. The dissimilarities are mainly due to the different types of executing processors. Our approach presumes a standard predefined microprocessor as executing unit while the Dorado, as a dedicated computer, exploits all its degrees of freedom to speed up the interpretive process. In fact, the only task of a fetch/decode unit is to consult the intermediate instructions f of list P and to pass the information necessary to perform the corresponding semantic operations to the execution unit. As the Dorado's processor is microprogrammable and it contains the microcoded interpreter routines m, a suited form of communication is indeed the start address of the routine corresponding to an f-instruction. In this way the Dorado is a good illustration of the concepts outlined in [4] by Hoevel. In contrast, as a consequence of the predefined execution unit in our LCP/CPU approach the communication between the two processors concists of the m-sequences themselves and not the start addresses. In the latter case, passing the start addresses should require an extra low-level interpreter to perform the memory indirect jumps. This reduces the performance significantly, since the time to perform the jump is often comparable to execute the m-sequence proper.

704

£H. Debaere/A LanguageCoprocessorasan HLL DirectedA~hitecture CASE

Exec.Unit (EU) Fetch/decode Unit

Interpr. Process

Operand Passing

programm, separate IFU

fetches intermediate code decodes by table lookup generates start address via separate bus to EU

via dedicated communication busses

G-MACHINE dedicated RISC-like processor

programm, separate IFU

fetches G-code decodes by hardware generates instructions via separate bus to EU

via shared addr/data bus

PROLOG microPREprocessor PROCESSOR

programm, intermediate code starts embedded separate sequencer which in instruction preprocessor controls logic stream generates instructions via instruction bus to EU

LCP/CPU

microprocessor

programm, coprocessor

Table 2.

Comparison of design decisions in the Dorado, the G-machlne, the Prolog Preprocessor and the LCP/CPU approach.

DORADO

dedicated microprogr, processor

fetches intermediate code embedded decodes by table lookup in instruction generates instructions stream via instruction bus to EU

The method used for passing literal operands from the f-level to the m-level is also different in both designs. The Dorado has dedicated communication busses, while in our approach, the operands are embedded into the instructions to be generated. This passing of operands from the f-level to the m-level is typical for interpretive systems and contributes also to the interpretive overhead. As a consequence of the strategies chosen the Dorado still needs instructions (and hardware) in the interpreter routines to pass these operand from the communication bus to the registers. In our approach the operands are fully part of the instruction stream and in this sense the interpretive overhead in the LCP/CPU case is less than in the Dorado-case. 3.2. Instruction Fetch Unit of the G-machine The G-machine [19] provides hardware support for ftmctional languages based on the graph reduction technique. Its central processor is a RISC-like architecture and requires a high instruction bandwidth. An intermediate language (G-code) is used to represent the functional program and an IFU fetches and translates the G-code, f, into RISC-processor instructions m. Table 2 summarizes the main features of this IFU. The dedicated RISC-like processor resembles the Berkely RISC [21] and is fed with instructions instead of starting or dispatch address as in the Dorado. As with RISC's the processor requires a high instruction throughput and the IFU acts as a fast instruction generator. In order to be able to generate one instruction per cycle, the IFU contains a

hardware decoding translation scheme, branch instruction support and different buffers to support the pipeline structure. The generated code is linearized : unconditional jumps are removed and executed at the IFU stage. The code enters the processor via a separate control bus. The literals are passed through a shared address/data bus to the processor. This strongly hardware supported processor achieves a considerable execution speed. However, the environment is strictly oriented towards G-code. 3.3. The Prolog Preprocessor. The Prolog preprocessor [20] speeds up the execution of Warren's code which is an intermediate representation of Prolog programs. It consists of a standard CPU (M68000) and a preprocessor. In contrast to both previous designs it uses a standard microprocessor as executing unit, which approaches our LCP/CPU architecture. The preprocessor internally contains the Warren code representation of a Prolog program. In fact, this code acts as a high-level microcode which controls the operations of the preprocessor : synthesizing M68000 instructions, spying data references performed by the M68000, controlling internal machines (sequencer, trail chip and comparator chip), etc. This strategy of keeping f-instructions in the coprocessor is the main difference with our approach. Our LCP acts as a fetch unit which fetches intermediate instructions from the CPU's memory space and tranlates it into an instruction sequence. The source of the

E.H. Debaere / A Language Coprocessor as an HL L Directed Architecture

Z-sequences can be any of the possible sources of the CPU. The Prolog Preprocessor expects its code to reside internally and hence achieves a higher performance (no bus cycles), at the cost of generality. As a conclusion to this section, we note that our LCP/CPU approach contains features of the three discussed cases. It has the generality of the Dorado's IFU, it generates instruction sequences as the IFU of the G-machine and fits in a standard microprocessor environment as the Prolog Preprocessor. We will now discuss some possible optimizations on the basic LCP/CPU configuration. 4. POSSIBLE 0PTIMIZATIONS Although the degrees of freedom are heavily restricted by the use of a standard microprocessor, as we have illustrated above, there are still some optimizations possible that improve our approach.

(i)

As the LCP fetches f-instructions (or data on sequential addresses), the techniques to reduce the number of instruction memory references in contemporary computers are also applicable. These techniques are caches, prefetch queues and branch target lookup tables. The compactness of the intermediate instructions [22] reduces the complexity of such optimizations and the enhanced spatial locality of the intermediate code favors hit rates [23]. A cache is used in both the Dorado and the G-machine. To illustrate this point, in a simulation environment we have observed a hit rate of 90% in a 256-byte direct mapped cache for M-code [24].

(ii) The general concept of the LCP permits its application to a broad range of HLL's. It suffices to change the internal program (C) of the LCP to support another language. Of course, this multilingual support is only applicable for truly similar HLL's. The Dorado illustrates this functionality well. (iii) The Programmable Adaptive Instruction Generation technique described in [25] adapts the instruction stream towards data values intercepted during memory references by the CPU. Hence, an extra speed up is obtained by eliminating communication between the instruction generation unit and the execution unit. This technique is used extensively in the Prolog Preprocessor.

(iv) As

a whole, the LCP reduces the loss in execution speed due to the interpretive technique caused by the desire for a compact representation size (intermediate language). Now, we can proceed in decreasing the size, at the cost of LCP

705

complexity. Since the LCP architecture can be changed it is worth to investigate how far we can go in this way. For M-code (8-bit opcodes) we computed that a modified Huffmann encoding with a 9-bit limited opcode length would reduce the representation size by almost 17% while the decoding LCP only requires a small amount of extra logic. It is less cumbersome to support such a decoding algorithm in an LCP than in a software interpreter.

5. EXPERIMENTAL RESULTS The LCP/CPU concept has been analyzed at our laboratory using two testbeds : the interpretive execution of Modula-2 programs and of Forth programs. In both cases a prototype has been built ([26],[27],[28]). Note that none of the prototypes is provided with optimization logic described above. The first prototype ([26],[27]) is a 8086 microprocessor environment equipped with a language coprocessor interpreting M-code. M-code is the intermediate language for Modula-2. When compared to a commercial interpretive Modula-2 implementation, the LCP exhibits an improvement by a factor of 6. Compared to the native code execution in a compiled Modula-2 environment the performance is lower only by a factor of 1.6, while the code compaction is somewhere between 5 and 4 [29]. This illustrates how well our approach implements the ideas of HLLDA's. The second prototype [28] accelerates the execution of threaded code programs. The popular language Forth was chosen to test the LCP/CPU idea in such environments. As threaded code has a hierarchical structure which depends on the HLL program, the interpretive overhead and, as a consequence, the speedup obtained by the LCP varies. For typical Forth programs the improvement is between a factor of 1.8 and 3.1 compared to a commercial Forth implementation. For further details of these prototypes refer to references [26], [27] and [28].

we

6. CONCLUSION In the previous sections we have presented our HLLDA approach consisting of a standard microprocessor and a language coprocessor. By comparing this concept with related architectures, we have illustrated the strong constraints imposed by the predefined execution unit, strongly reducing the degrees of freedom in changing architectural parameters. Despite this limitation, our approach still exhibits several advantages over typical HLLDA designs :

706

(i)

E. 14. Debaere / A Language Coprocessor as an HL L Directed Architecture

The execution unit is a standard microprocessor which reduces both developing cost and production cost of the HLLDA, and enhances flexibility;

(ii) The language oriented part consists of a coprocessor which fits in a standard microprocessor environment and leaves the capabilities of this environment untouched; (iii) The language coprocessor dynamically tailors the instruction stream for the microprocessor which allows a wide range of application and language oriented optimizations; (iv) The obtained complexity of the coprocessor is low, while reducing the interpretive overhead considerably. We feel that these advantages approach a viable alternative in HLLDA context.

make the LCP the general

7. ACKNOWLEDGEMENT The author wishes to thank prof. dr. Van Campenhout for his thorough proofreading. This research was partially supported by the 'Instituut ter Aanmoediging van Wetenschappelijk 0nderzoek in Nijverheid en Landbouw (IWONL)'

[ 7] Silbey, A., Milutinovic, V., and Mendoza-Grado, V. (1986), "A survey of advanced microprocessors and HLL computer architectures," IEEE Computer (USA), vol. 19(8), pp. 72-85. [ 8] Wirth, N. (1981), "The Personal Computer Lilith," in (ed): Technical report ETH ZGrich (Switzerland). [ 9] Moon, D. A. (1987), "Symbolics architecture," IEEE Computer (USA), vol. 20(1), pp. 43-52. [i0] Kaneda, Y., Tamura, N., Wada, K., Matsuda, H., Kuo, S., and Maekawa, S. (1986) , "Sequential Prolog machine PEK," New generation computing, vol. 4, pp. 51 - 65. [ii] Flynn, M. J., and Hoevel, L. W. (1983), "Execution architecture : the DELTRAN experiment," IEEE Trans. on Computers (USA), vol. C-32(2), pp. 156-175. [12] Wakefield, S.P., and Flynn, M.J., (1987),"Reducing execution parameters through correspondence in computer architecture," IBM Journal of Research and Development (USA), vol. 31, no. 4, pp. 420-434. [13] Sammer, W., and Schw~rtzel, H., (1982),"Chili, eine moderne Programmiersprache fGr die Systemtechnik," Springer, Berlin, (FRG) [14] Wilson, D. (1986), "Language specific ICs : Optimal execution solution," Digital Design. (USA), vol. 16(1), pp. 71-76.

REFERENCES [ i] Bashkow, T.R., Sasson, A. and Kronfeld, A. (1967), "System design of a FORTRAN machine, " IEEE Trans. on Computers (USA), vol. 16, no. 8, pp. 485-499. [ 2] Milutinovic, V., and Waldschmidt, K. (1983), "A high-level language architecture for time-critical dedicated microprocessing," Microprocessing and mlcroprogramming (Netherlands), vol. 12, pp. 33-41. [ 3] Chu, Y., and Abrams, M. (1981), "Programming languages and direct-execution computer architectures," IEEE Computer (USA), vol. 14(7), pp. 22-31. [ 4] Hoevel, L. W. (1974), "'Ideal' directly executed languages:an analytical argument for emulation," IEEE Trans. on Computers (USA), vol. 23, no. 8, pp. 759-767. [ 5] Wilkes, M. V. (1982), "The processor instruction set," SIGMICRO Newsl. (USA), vol. 13(4), pp. 3-5. [ 6] Hopkins, W. C. (1983), "HLLDA defies RISC : thoughts on RISCs, CISCs and HLLDAs," SIGMICRO Newsl. (USA), vol. 14(4), pp. 7O-74.

[15] Hoffmann, R. (1983), "A classification of interpreter systems," Microprocessing and Microprogramming (Netherlands), vol. 12, pp. 3-8. [16] Pemberton, S., and Daniels, M.C. (1982), "Pascal implementation : the P4 compiler," Ellis Horwood, Chichester, United Kingdom [17] Debaere, E.H., (1988), "The extension of the coprocessor concept to the instruction path," Proc. European Simulation Multiconference, Nice, i-3 june, France, to appear. [18] Lampson, B. W., McDaniel, G., and 0rnstein, S. M. (1984), "An instruction fetch unit for a high-performance personal computer," IEEE Trans. on Computers (USA), vol. C-33(8), pp. 712-730. [19] Thakkar, S.S., and Hostmann, W.E. (1986), "An instruction fetch unit for a graph reduction machine," Proc. 13th arLnual international symposium on computer architecture, 2-5 june, Tokyo, Japan. [20] Kn6dler, B., and Rosenstiel, W. (1986), "A Prolog preprocessor for Warren's abstract instruction set," Microprocessing and Microprogramming (Netherlands), vol. 18, pp. 71-80.

E.H. Debaere / A Language Coprocessor as an HL L Directed Architecture

707

[21] Patterson, D. A., and Sequin, C. H. (1981), "RISC I:A Reduced instruction set VLSI computer," Proc. 8th ann. symposium on computer architecture, May 12-14, Minesota, (USA), pp. 443-450.

[26] Debaere, E. H° (1986), "Language Coprocessor for Interpretive Execution of Modula-2 Programs," IEE Electronics Letters, (United Kingdom), vol. 22(24 ), pp. 1302-1304.

[22] Wirth, N. (1986), "Microprocessor architectures : a comparison based on code generation by compiler," Communications ACM (USA), vol. 29(10), pp. 978-990.

[27] Van Campenhout, J. M., and Debaere0 E. H. (1987). " Language coprocessor to support the interpretation of Modula-2 programs," Microprocessors and Microsystems, (United Kingdom), vol. ii, no. 6. pp. 301-307.

[23] Flynn, M. J., Mitchell, C. L., and Mulder, J. M. (1987), "And now a case for more complex instruction sets," IEEE Computer (USA), vol. 20(9), pp. 71-83. [24] Blomme, R., Brokken, D., and Van Campenhour, J. M. (1987), "Driemaandelijks aktiviteitsverslag, spilprogramma robotica, robotsturing IWONL conventie 4930," Technical report, Electronics Laboratory, State University of Ghent, Belgium. [25] Bursky, D. (1985), "Instruction generation technique speeds program execution, " Electronic Design, voi.33, no.3, pp. 40-44.

[28] Debaere, E. H° (1987), "A Language Coprocessor for the Interpretation of Threaded Code," Microprocessing and Microprogramming (Netherlands), vol. 21, pp. 593-602. [29] Van Campenhout, J.M. (1986), "The combination of interpretation and multiprocessing : a marriage of reason ?," Proc. Parallel Computing '85, North-Holland, Amsterdam (Netherlands), pp. 389-394.