Future Generation Computer Systems 21 (2005) 679–685
Retargetable code generation for application-specific processors A. Doroshenko, D. Ragozin∗ Institute of Software Systems of NASU, Academician Glushkov Prosp. 40, 13187 Kyiv, Ukraine Available online 19 November 2004
Abstract An approach of intelligent retargetable compiler is introduced to overcome the gap between hardware and software development and to increase performance of embedded systems. It focuses on knowledgeable treatment of code generation where knowledge about target microprocessor architecture and human-level heuristics are integrated into compiler expert system. Structure of an experimental compiler supporting code generation for DSPs and VLIW-DSPs is described. A technique to detect optimal instruction set architecture for program execution is presented. Results of code generation experiments are presented for DSPstone benchmarks. © 2004 Elsevier B.V. All rights reserved. MSC: 68Q10; 68Q20; 68Q25 Keywords: Instruction level parallelism; Retargetable compiler; DSP; VLIW; Processor architecture
1. Introduction Most modern microprocessors are applicationspecific instruction processors (ASIPs) [5] that have kernel expanded with different application-oriented units. Efficient utilization of specialized units can be achieved by programming in assembler language, because traditional compilers usually can not handle extended instructions in efficient way. Digital signal processor (DSP) architecture is a good example of ASIP, it provides irregular instruction level parallelism (ILP) Standard compilers usually are unable to take into account important performance opportunities of ASIPs, the solution should be seen on the ∗
Corresponding author. E-mail addresses:
[email protected] (A. Doroshenko);
[email protected] (D. Ragozin). 0167-739X/$ – see front matter © 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2004.05.008
way of intelligent manipulation of knowledge about both software to be designed and target architecture. The paper reflects our results and experience of research in retargetable compilers for DSP and VLIW processors in our HBPK-2 project (http://dvragozin. hotbox.ru). Retargetable compilation problem (in aspect of quality code generation for different microprocessors) has arisen 15–20 years ago [5]. The first generalpurpose compiler RECORD was built by R. Leupers, likely systems were built by other researchers, e.g. Flexware, MIMOLA [5]. Usually RISC microprocessors have orthogonal register file(s) and ISA, so optimal instruction scheduling is straightway combinatorial. If microprocessor has complex structure, like RISC kernel + DSP coprocessor, traditional compiler utilises generally only RISC kernel, but not DSP extension [1]. Compiler is not able
680
A. Doroshenko, D. Ragozin / Future Generation Computer Systems 21 (2005) 679–685
to extract knowledge about application-oriented units utilization from processor description. So, now all efforts in retargetable compiling are concentrating on improving code generation methods for wide processor families. In this paper, an approach of retargetable compiler is proposed for enhancing performance of compiled high-level programs. Knowledge-oriented techniques which can improve code generation quality applied to irregular architectures are presented. In Section 2, a simple motivating example of increasing instruction parallelism at DSP code generation is presented. In Section 3, the structure of our retargetable compiler are considered. In Section 4, code generation techniques are described. In Section 5, knowledge base integration into the compiler is considered. In Section 6, a technique of decision an optimal processor architecture is described and numerical results of code generation improvement are presented.
2. Simple motivating example To illustrate possibilities for enhancing code quality on DSP consider a simple example. As an example of optimisation for DSP consider distributing variables over Harvard memory architecture for digital signal filter (convolution): s=0; for(i=0;i++;i
(1) R1=mem(Ia++); (2) R2=mem(Ib++); (3) R3=R1*R2; (4) RS=RS+R3. Harvard architecture with two memory spaces executes (1) and (2) in parallel. Consider software pipelined loop: (superscript denotes loop iteration number, i is iteration number, i = 3, . . . , N): (1)1 (2)1 — loop prolog (1)2 (2)2 (3)1 — loop prolog (1)i (2)i (3)i−1 (4)i−2 — loop body (3)N (4)N−1 — loop epilog (4)N — loop epilog The loop body takes one cycle for execution. But if arrays are located in one memory space, instruction schedule of the loop body should be: (1)i (3)i−1 (4)i−2 — loop body (2)i — loop body which takes twice time of the original loop. Instructions (1) and (2) can not be scheduled as one processor command, because arrays are placed in one memory space by default. Accurate information about variable placement can be got only after collecting information about memory reference conflicts.
3. A model and structure of retargetable compiler At Fig. 1 retargetable compiler prototype HBPK-2 [6] structure is reviewed. The compiler consists of four major modules: Lexical and Syntax Analyser (LSA), Global Optimiser (GO), Code Generator (CG), Code Analyser (CA), and external XML-to-C compiler RK3.
Fig. 1. Structure of retargetable compiler prototype HBPK-2.
A. Doroshenko, D. Ragozin / Future Generation Computer Systems 21 (2005) 679–685
681
4. Code generation
Fig. 2. Example of sample program and formed HG.
Description files are in XML format where all directives and tunings for compiler parts are collected. Syntax and lexical analyser module forms a hierarchical graph (HG) of data and control flow derived from source program code. Analyser can support any programming language. An example of HG of a sample program and its source code are presented in Fig. 2. Program hierarchical graph H = (T, G), where T is a tree of hierarchy and G is a sequence of acyclic oriented graphs, consists of vertices of two types: recognisers and transformers. Each transformer represents a basic block. Each recogniser controls the flow of program and maybe has a son on current hierarchy level and its “body” at lower hierarchy level. All loops, jumps, condition operators are represented as recognisers. At the HG level global optimisations are expressed as graph grammar. It is a set of rules (graph productions) which would be applied iteratively to HG. Graph production is a set (L, R, E, C), where L and R are two graphs, left and right parts of production, E is transforming mechanism, C is a condition of production applicability. Production p is applied to hierarchy graph G in following way: (1) optimiser tries to find out L entries in G (if C is true), (2) part of G, which corresponds to L, would be deleted and some context graph D is retrieved; (3) R is built into D by mechanism E, final graph H is retrieved. Graph grammar in HBPK-2 successfully extracts index variables, looping variables (needed to express circular buffers widely used in DSP), triangle variables, index transformation expressions and finds out aggregate variables (reductions).
In HBPK-2 code generation is divided into two parts: instruction selection step and instruction scheduling and register allocation step. After code generation the code analysis step is performed using code analysis module. Classic DSP is treated as VLIW with variable length instruction word. Other DSPs “irregular” features like non-orthogonal register files, several memory spaces and different computation modes can be handled using enhanced code generation techniques. We avoid solutions like DSP-C [4] extension to C language due to its non-portability. HBPK-2 compiler uses common code generation methods: combined instruction scheduling and forward register allocation step which allows to avoid extra register spill code if register pressure is high. The potential of HBPK-2 is much wider and can cover hardware/software codesign. The code generator deals with target architecture and provides possible machine dependent optimisations. For efficient code generation it needs information about principles of optimization for target processor. None of traditional compilers (like freeware GCC) can be ported to DSP architecture and produce efficient DSP low level code because RISC and DSP processors have different programming paradigms. DSP architecture is strongly oriented to speed up only certain algorithms like convolution, Fourier/Hartley transforms, matrix multiplication, etc. That is why we have to define knowledge on “how code must be generated to 100% utilise the processor”. Also we use another technique based on iterative program graph analysis which can improve code generation results for software pipelining, data clustering and distributing data into memory spaces for Harvard architecture.
5. Compiler expert system productions There is a strong need in unified representation for the compilation knowledge base and defining logical deductive system behaviour to get counsels about particular code generation problems from the base. We use production expert system which has following general
682
A. Doroshenko, D. Ragozin / Future Generation Computer Systems 21 (2005) 679–685
form of rules representation: NAME
[RETURN ] [IF OF [,[, ..., [cond_N], ...]] DO ] [... other IF-DO rules] where production name — unique expert production name; default value — default value (if production returns it); predicate can be ONE, ALL, MOST, number of true conditions or production name and defines how many further conditions must be true in order to production action should be proceed; cond x — conditions (represented as functions), which result is boolean; action — action that should be performed if predicated conditions are true. The set of productions must be complete to specify all features, but whole “expert system” is not integral — each production serves in particular code generation procedure. Generally, the production set is a good formalism for representing knowledge because in comparison with other approaches (like neural networks and frames) it is simple and high-speed. Expert system productions describe processor features in unified form and control the code generation process. Basic expert production types [6,2] are following. (1) Expert variables. It is a set of variables, defining a long list of basic features of described processor: NAME RETURN ;. (2) Optimising processor-dependent transformations. These productions are used for architectures with complex operation acceleration instructions: |a + b|, |a − b|, (a + b)/2: NAME IF DO . (3) Templates. Often ASIPs have instructions, like partial division and partial square root. In the source code full instruction must be changed into procedure, which calculates the function via partial operations. For example, square root at ADSP-21060: Y=partroot(X); for (i=1; i<=3; i++) Y=0.5*Y*(3-X*(Y*Y));
(4) Tables of advices (productions). In common code generation cases usually the compiler uses tables of cases, for example, for data structures access. Consider x[i].p->t->k: it consists of several parts: base array reference (x); array reference ([i]), field reference (.p), field reference by pointer (->t, ->k). For such types the table with code generations procedures is formed. Other tables are made for data access, function prolog/epilog generation, stack frame forming, array access. (5) Pragmas (compiler directives). Compiler directives #pragma are used for interaction between the programmer and the compiler. Internally pragma likes the expert variable that can be changed by the programmer. Expert productions allow to apply microprocessorspecified optimisations and templates. Complex instructions (like addition + multiplication) are completely supported. All instructions are described in unified graph format which are processed by the compiler uniformly. Register allocation and instruction scheduling are joined into single phase, which is the only possible way to provide retargetable code generation for “unknown” processor applying expert system consultations. Special attention is paid to extended register files utilisation, like DSP’s accumulator registers. They have capacity from 48 up to 80 bits, but usually used only for reduction handling. The compiler must extract additional information from program statements to handle these register banks effectively, for example for storing reduction variables. Also if processor has clustered register files (e.g. four clusters in ADSP-21k, two clusters in TMS320C60), expert system must be used to consult compiler how register file chunks can be used providing coefficients for cluster allocation during scheduling. Expert system holds knowledge about applicable code generation methods for prescribed cases (instruction used only with some addressing modes, joined computational and control transfer instructions, predicated instructions, delayed control transfers). For example, DSP ADSP-21k has only one command where dual reference to both memory spaces can be made: Rx=DM(Ix, Mx), Ry=PM(Iy, My), .
A. Doroshenko, D. Ragozin / Future Generation Computer Systems 21 (2005) 679–685
The command is suitable for most DSP kernel procedures, but it uses only special addressing with postincrement. So, to utilise dual memory access the compiler must have addresses placed in particular registers to schedule instructions properly. The expert system can consult compiler for such cases. Sometimes expert system information is insufficient for optimisations, e.g. exploiting Harvard memory architecture. Previously done research gives us [7] very rough methods to exploit possibilities for data processing, especially for DSP. Exact information about variable distribution can be received only from conflicts while loading values to registers from memory. This information is accessible only after instruction scheduling pass. So, this optimisation requires at least two cycles of the code generation — first one to get information about the conflicts, second one to generate improved schedule. Further a code analysis method is presented, which combines iterative code generation process and analysis of generated code quality.
6. Code and processor architecture analysis To solve the variable distribution problem next scheme is proposed. For memory locations set M = {mi } which consists of all program variables and memory referenced constants, we define set MC ∈ M ∗ M, MC = {ci,j |i < j}, ci,j ∈ N; where ci,j — cost of conflict between loading variables (if they are located in one memory space) mi and mj . Initially MC hold zeroes. Each time if the conflict between mi and mj occured, value 10D is added to ci,j , where D — the loop nest degree of conflicting instructions (if out of loops D = 0: 10D = 1). After collection completion MC is sorted in descending order. For biggest values ci,j variables mi and mj must be placed into different memory spaces, after resolving ci,j will be deleted (set to 0) and matrix columns are merged. Other similar important tasks are variable clustering and optimal software pipeline. The variable clustering is used for microprocessors (usually DSP) which have small possibilities of indexed variable addressing (especially for stack frame). Usually effective addressing is possible only for small offsets, up to 8 or 16 words.
683
Thus, it is useful to place jointly used variables in adjacent memory cells. However, the most important problem is optimal software pipeline. One of the best software pipeline algorithms — Enhanced Pipeline Scheduling (EPS) [3] requires loop unrolling before processing loops. But, accurate unroling coefficient in unknown. Iterative procedure is necessary here, joined with expert system consultations about set of unrolling coefficients for reducing code generation time. Also EPS properties are useful for processor architecture analysis. Consider a program code that is part (loop body) of the MPEG-2 decoder which executes about 2/3 of all decoding time. for(j=0; j>1)-p2[i]; if (v>=0) s+=v; else s-=v; // s+=abs(v); p1+=lx; p2+=lx; } With widely used DSP ADSP-21k the loop body can be executed in 6 cycles: R1 = DM(I1,1); // (1)1 LCNTR=N-2, do L0002 until LCE; R1 = R3+R1, R3 = R1; // (2)i (3)i R1 = R1+1; // (4)i R1 = LSHIFT R1 BY 1, R2 = DM(I2,1); // (5)i (6)i R1 = R1-R2; // (7)i R1 = ABS R1; // (8)i L0002: R4 = R4+R1, R1 = DM(I1,1); // (9)i (1)i+1 Traditional DSP kernel can not provide enough instruction level parallelism. As EPS method works without taking into account availability of microprocessor resources, it can find out maximum available parallelism of loop body: (9)i−6 (8)i−5 (7)i−4 (5)i−3 (6)i−3 (4)i−2 (2)i−1 (3)i−1 (1)i — loop body, i = 7, . . . , N Analysis shows that for execution the pipelined loop body in 1 cycle processor must have: (1) 12 general purpose registers; (2) a shifter, 3 adders, an incrementor, a functional unit “absolute value”; (3) 9 slots wide
684
A. Doroshenko, D. Ragozin / Future Generation Computer Systems 21 (2005) 679–685
Table 1 Performance improvement using prototype compiler Test program
Permanent compiler (cycles)
HBPK-2 prototype (cycles)
Real update N real updates Complex update Complex updates Dot product FIR Convolution Matrix Matrix 1x3 FIR2DIM IIR one biquad IIR n biquad LMS FFT
5 1006 18 2110 31 91 126 11618 137 12147 23 211 111 2271
5 406 10 811 8 20 19 1844 37 2906 12 80 35 1178
31 8068 132 21864 175 968 760 102964 912 19194 181 2669 1832 55555
instruction word. So, using EPS method we can “measure” needs in processor resources and make decision about hardware architecture. Such analysis can be provided during compilation for most nested (or user selected) loop bodies. Analysis details must be guided with expert system. Some of described techniques were tested using retargetable compiler prototype HBPK-2. Here, are some results of code generation on standardised benchmarks DSPstone [8] (Table 1). In left column results for ADSP-21060 are presented, in right column — for neuroprocessor L1879VM1 (NM6403). The most speedup is achieved at some tasks: filtering, convolution, FFT, matrix multiplication which uses highest instruction level parallelism provided by DSP architecture. Such optimisation level cannot be achieved with useful “classic” optimisations but results presented in the tables (50–300% speedup) is achieved owing to expert system utilisation.
7. Conclusion An approach of intelligent retargetable compilation that gives improvements in code generation for DSP and VLIW processors is developed and demonstrated. The results are due to: (1) improved utilisation of microprocessor units; (2) iterative optimisation techniques like variable distribution over memory spaces; (3) analysis of target processor architecture.
28 2605 84 8316 51 276 71 6253 247 2919 118 842 213 7828
Performance improvement (%) 0 59.6 44.4 61.5 74.1 78 84.9 84.1 72.9 76 47.8 62 68.4 48.1
9.7 67.7 36.3 61.9 70.8 71.4 90.6 93.9 72.9 84.7 34.8 68.4 88.3 85.9
Proposed methods expand compiler retargetability over larger range of microprocessors and can be applied for irregular ASIP architecture. The research involves new directions including adaptation of compilers to industrial highly-specialised microprocessors (VLIWs, neuroprocessors) based on machine learning techniques.
References [1] S. Bashford, Code Generation Techniques for Irregular Architectures, Tech. Rep. 596, Universitat Dortmund, 1995. [2] A.Yu. Doroshenko, D.V. Kuivashev (Ragozin), Intelligent compacting compilers for VLIW microprocessors, Prob. Progr. 1–2 (2001) 138–151 (in Ukrainian). [3] K. Ebcioglu, T. Nakatani, A New Compilation Technique for Parallelizing Loops with Unpredictable Branches on a VLIW Architecture, Languages and Compilers for Parallel Computer, MIT Press, Cambridge, MA, 1990, pp. 213–229. [4] K. Leary, W. Waddington, DSP-C. A standard high level language for DSP and numeric processing, in: Proceedings of ICASSP-90, ACM Press, New York, April 1990, pp. 1065–1068. [5] P. Marwedel, G. Goossens, Code Generation for Embedded Processors, Kluwer Academic Publishers, Dordrecht, Boston, London, Lancaster, 1995. [6] D.V. Ragozin, Retargetable code generation methods for irregular long instruction word microprocessor architectures, PhD thesis, Institute of Software Systems, National Academy of Sciences of Ukraine, Kiev, 2002 (in Ukrainian). [7] M. Saghir, P. Chow, C. Lee, Exploiting dual data-memory banks in digital signal processors, in: Proceedings of the 8th International Conference on Architectural Support for Programming
A. Doroshenko, D. Ragozin / Future Generation Computer Systems 21 (2005) 679–685
685
Languages and Operation Systems, ACM Press, New York, 1996, pp. 234–243. [8] V. Zivojnovic, H. Schraut, M. Willems, R. Schoenen, DSPs, GPPs, and multimedia applications — an evaluation using DSPstone, in: Proceedings of the International Conference on Signal Processing Applications and Technology, DSP Associates, Boston, Mass, 1995, pp. 1779–1783.
school. He is an author of more than 70 technical papers in journals and international conference proceedings including two monographs. Currently he is engaged in a project on investigation of high performance computation in metacomputing architectures. His current interests concentrate on models of parallel computation, parallel programming methodologies and coordination issues in distributed software systems.
Anatoliy Yu. Doroshenko received his Master degree in Computer Science in 1973 from National T. Shevchenko University of Kyiv (NUK), Kyiv, Ukraine. PhD and higher doctorate degrees he received in 1989 and 1997, respectively, both from Glushkov Institute of Cybernetics of National Academy of Sciences of Ukraine (NASU). His current position is research director at Institute of Software Systems of NASU and visiting professor of Computer Science at NUK and KyivMohyla University. His professional activity includes research and development of formal modeling techniques and parallel programming methods, and teaching parallel computer systems at higher
Dmitry V. Ragozin received Magister Diploma from Chernigov Technological Institute (former branch of Kiev Politechnical Institute), and PhD degree from Institute of Software Systems of NAS of Ukraine. In 2002, he became a scientifical researcher in Institute of Software Systems of NASU. Currently he is Sr. Scientifical Researcher in Intel Nizhny Novgorod Labs and associative professor in State Nizhny Novgorod University. His research interests include compilers, automatic paralleling software, machine learning, modeling and system and real-time programming.