Journal of Systems Architecture 45 (1999) 1139±1149
www.elsevier.com/locate/sysarc
Microprocessor design for embedded system Alessandro De Gloria
1
University of Genoa, DIBE, Via Opera Pia 11A, 16145 Genova, Italy
Abstract This paper discusses the two main approaches to the design of microarchitectures for embedded processors: VLIW and Superscalar. The latter is preferred for the particular needs of an embedded system. Then a methodology and a design ¯ow for Superscalar microprocessor design is presented. The approach relies on the exploitation of the features of the application to which the processor is dedicated. Also particular attention is given to fast time-to-market and to re-usability issues that are key factors for embedded microprocessor design. Ó 1999 Published by Elsevier Science B.V. All rights reserved. Keywords: Microprocessor; VLIW; Superscalar; Design methodology; Microprocessor evaluation
1. Introduction The growth of the VLSI technology is likely to go on for the next 20 years with the same trend as shown by Moore's law. This leads to the possibility of building complete complex systems on a single chip. In parallel with the possibility of increasing the number of transistors on a chip, we expect that other factors, such as clocking fre-
1
E-mail:
[email protected]
quency and power consumption, will be further enhanced. Putting all together, this means that we shall have more complex chips that will be also very powerful, in terms of computing power, and that will consume low power. This picture draws us to foresee a wide spread of electronics in every day life. One of the key factors, that will encourage the wide diusion of electronic devices, will be the improvement of the man±machine interface, where the great challenge is to allow the use of complex electronic systems by non-electronic specialists. The main ®elds that will be aected by this spread will be automotive and consumer electronics. Car
1383-7621/99/$ ± see front matter Ó 1999 Published by Elsevier Science B.V. All rights reserved. PII: S 1 3 8 3 - 7 6 2 1 ( 9 8 ) 0 0 0 5 4 - X
1140
A.D. Gloria / Journal of Systems Architecture 45 (1999) 1139±1149
driving aids, web navigators, hand held devices as well as home automation devices (to cite a few) will be very common. Along with the increasing technological capability, we shall assist to a market explosion where the competition will be very hard and factors like low-cost and short time-tomarket will be very important for the success of a product. In this scenario dedicated systems, which are speci®cally designed for performing determined tasks, will be particularly attractive. They allow saving silicon area since their hardware is speci®cally designed and because they perform very well since the core functions, that are known at the design time, can be accelerated both via hardware and software. The main problem shown by dedicated systems is the large design time they require and the diculty in matching a short timeto-market. This is due to the design of dedicated hardware units and to the development of the software that to some extent cannot be ported from other applications given the dedicated nature of the system on which it has to run. A possible solution to this problem is the use of a library of complex functions that can be used in dierent projects by simply re-targeting some functionality. In this way, the units can be also equipped with software that acts as driver for the unit. This is the case of embedded systems that are dedicated to some extent at the application, in the sense that they are composed of only the units needed to perform a certain task, but the units are not speci®cally designed, they already exist as models from a library. With this approach it is possible to perform a fast virtual prototyping, essential for the tuning of the system, and then a rapid prototyping, needed for testing the system in the ®eld. In all these applications, the microprocessor will be the key to achieve most of the requirements dictated by the system. Microprocessor is an enabling technology that allows managing the behavior of a system, in particular the man±machine interface and the algorithmic functions that cannot
be implemented directly in hardware. In addition, microprocessors have the same constraints as the systems. In particular they are to be dedicated for saving silicon and enhance the performance, they need a short design time, they must allow software and hardware re-use and permit a virtual veri®cation inside the system they are embedded in. 2. VLIW vs. Superscalar The microprocessor technology currently shows two main architectural solutions to the design of high performance microprocessors: the VLIW machines [3] and the Superscalar machines [2]. Both of them show pros and cons for their use in an embedded system. VLIW machines represent the best cost/performance tradeo since they avoid any complex control part and use most of the hardware to execute the instructions. A VLIW machine (Fig. 1) is characterized by the following features: · It has an Harvard organization. · It is composed of several functional units whose execution time is known at the compilation time. · The instruction is composed of ®elds, each ®eld is associated with a functional unit to which it sends the command about the function to perform.
Fig. 1. A simpli®ed block diagram of a VLIW machine.
A.D. Gloria / Journal of Systems Architecture 45 (1999) 1139±1149
· At each machine cycle a new instruction is issued. · The machine micro-architecture is exposed to the compiler. · The parallelism intrinsic in the application has to be detected by the compiler. · The parallelism is found by running training programs that are representative of the applications which will run on the machine [1]. Typical drawbacks of the VLIW organization are: · A VLIW machine is very sensible to the training programs, and a non-precise determination of them can lead to a poor performing machine [4]. · The instruction is determined by the functional units of the machine. A machine with less or more functional units has a dierent instruction set, therefore the software at the binary level developed for a machine cannot be re-used in another more powerful (i.e. with more functional units) machines. · The instruction of a VLIW machine is very large, it requires a non-conventional memory word width, and though this could be not a problem in VLSI design, the worst is that most of the memory space is wasted. From the above description it descends that when using VLIW machines it is not possible to re-use third party software since it is usually distributed at binary level. A VLIW machine is very sensible to the application and much time has to be dedicated to its tuning by running training applications. The VLIW machine can exploit scienti®clike applications that seldom use pointer-chain structures since with static structure it is possible to de®ne which memory accesses are independent of each other. If there is a use of pointer-chain structures the address of the memory references can be known only at run time. Therefore, this structure constitutes a barrier to the exploitation of parallelism since the re-ordering of memory access instructions cannot be applied.
1141
Fig. 2. A simpli®ed block diagram of a Superscalar processor.
Now let us consider the features of a Superscalar machine (Fig. 2). In the following we refer to a Superscalar machine where the instructions of the ISA are internally decomposed into microoperations that directly correspond to hardware functions. We can summarize the Superscalar features as follows: · It has of preference a Harvard organization but this is not necessary. · Its data-path structure is similar to a VLIW machine. · It discovers the parallelism at run time by using a very complex hardware structure. · It has some sort of interpretation level between the data path and the instruction. This level is responsible for translating machine instructions into commands to send to the functional units. · The interpretation level allows to change the structure of the data path while the instructions set remains the same. In this way, it is possible to run the same binary code in machines with dierent data path. · In principle, compilers do not intervene in discovering the parallelism. In practice, this is not true for the current hardware limitations. In order to exploit the parallelism as much as
1142
A.D. Gloria / Journal of Systems Architecture 45 (1999) 1139±1149
possible, the compiler must re-arrange the instructions so that the machine can fetch instructions that are independent and can be executed in parallel. · The hardware limitation that constrains the parallelism discovering is due to the length of the re-ordering window and to the number of instructions that can be fetched in a cycle. Both limits are due to the technology that in principle could be overcome in the future. · Due to its complexity, it is very hard to test a Superscalar machine. From the characteristics of the VLIW and Superscalar machines along with the constraints in the design of embedded systems we can see which of these two machines is well suited for being used in this kind of systems. The design of an embedded system generally faces two main kinds of problems: low-cost and short time-to-market and, in particular for the design of the microprocessor, ®tting of the characteristics of the application software. From these requirements, we deduce that the Superscalar is the most suitable architectural organization. The reasons of this belief lay on the following observations: · A Superscalar microprocessor has built in hardware most of the features that in VLIW are demanded to software. Therefore, it has the potentiality to be used in dierent applications, since the scheduling hardware can exploit the application parallelism that with a VLIW processor is to be found with test applications. This allows the user to save design time. · A Superscalar microprocessor may be re-used in dierent applications without any or with a minor re-design. This again descends from the above item. · A Superscalar processor may be updated for higher or lower computing power, depending on the needs of the application, without any change in the application. The change can be ac-
complished at data-path level without aecting the ISA level. · Binary software can run on dierent version of a Superscalar processor without any change. This is again due to the interpretation level. · For the reason stated above, a Superscalar processor may be provided to customers as an IP cell, pre-compiled an synthesized. This is due to the reason that a Superscalar processor, for its dynamic scheduling mechanism, is more ¯exible and adaptable to a give application. · A Superscalar processor can then speed up the design leading to a short time-to-market. · A Superscalar processor is more complex than a VLIW machine. The interpretation level has to be accurately designed and testing vectors are not easy to develop, but this can be designed only one time and then these tasks are not required for any design. · The complexity of a Superscalar leads to wider area than a VLIW machine. This can be the main drawback. However, with the likely improvements in the VLSI technology, the cost of the hardware overhead imposed by a Superscalar will dramatically reduce its in¯uence in the design cost. The designing time and the necessity to put the product on the market as soon as possible will be more important. · For certain designs where the area can still be a main factor, one can hypothesize a ®rst silicon product with a Superscalar processor to match the time-to-market constraints. Then, once the product is on the market, a new version with a VLIW processor can be released. With the above considerations, we can de®ne a design methodology for a microprocessor in an embedded system that is based on a Superscalar architecture that can be eventually re®ned for the speci®c application purpose. We refer to the re®nement as to a process of de®ning the data path and the instruction coding of a Superscalar processor without intervening on its scheduling
A.D. Gloria / Journal of Systems Architecture 45 (1999) 1139±1149
mechanisms and then leaving it the same for any application. To this end, a set of tools is required for achieving in a fast way the de®nition of the mentioned part of the architecture and predicting its behavior with the given application. For this purpose, we have developed a high-level design environment that allows us to perform this task. 3. A Superscalar microprocessors cell for embedded system We propose to use, for embedded system applications, a Superscalar architecture whose schematic is shown in Fig. 3. We single out four main blocks in the architecture: · instruction fetch and translation unit: it fetches the instruction from memory and translates it in a format manageable by the scheduling unit. In doing the translation, the unit can decompose the instruction in more simple operations that directly correspond to functions performed by the data path; · branch prediction unit: it predicts the direction taken by a branch, by performing a statistical analysis on the previously executed branches. A table is maintained and when a new branch has to be inserted and the table is full, the least
1143
recently used branch is removed. The number of nested branch allowed is programmable at synthesis time; · scheduling unit: it receives basic operations from the fetch unit and loads them in the operation window performing the register renaming operation. At each cycle, the unit sends to the data path the operations to execute on the basis of the operands availability. The unit builds a virtual dependence graph in order to issue the operations that have no dependence on any operation that has not been executed; · data path: it is composed of the functional units and the register ®le. The functional units are connected with the register ®le or with each other by bypasses. The register ®le is organized as a matrix where the number of column is equal to the number of physical registers of the architecture, while the number of row corresponds to the number of nested branches allowed. As a new branch is taken a new row is allocated to the branch and the values of the register contained in previous row are copied in the new one. If a predicted branch is discovered to fail at the time of its execution the rows from the one associated to that branch to the last allocated one are reset. On the other hand, if the branch has been correctly predicted the row as-
Fig. 3. A simpli®ed block diagram of the micro-architecture of the instruction translation and scheduling unit. The blocks showed do not necessarily correspond to hardware units, they identify speci®c functionalities whose hardware design is not the matter of the present ®gure.
1144
A.D. Gloria / Journal of Systems Architecture 45 (1999) 1139±1149
sociated with that branch becomes valid and, all the instructions associated to the previous rows are executed, the rows associated with them are freed. Two pointers are associated to the matrix, one points to the ®rst busy row, the other to the last busy row. The rows are managed as a circular buer. The target of the design is the de®nition of the Superscalar micro-architecture and in particular the de®nition of the data path and the fetch and translation units. We consider the operations as ®xed and representing basic hardware functions, that do not need any sequence to be executed. All the sequences are managed by the scheduling unit based on the translation performed by the translation unit. More precisely the goal of the design is to de®ne the instructions of the architecture, where each instruction is thought of as a combination of operations, and the number and the type of functional units that constitute the data path. The methodology consists in running several benchmarks representing the domain of the application, collecting statistics about the type of instructions executed, de®ning a set of resources for the data path, and coding basic operations that statistics reveals as frequently grouped, in a single instruction. We consider the micro-architecture of the processor ®xed and parameterized. In the sense that while the functions of the blocks composing the micro-architecture are de®ned, the dimensions of the blocks are to be de®ned in order to best match the application domain. The dimensions determine the computing power of the processor in terms of both the level of speculation admitted (i.e. parallelism) and the type of operations performed. For instance, we consider as a parameter the number of pending branches, the size of the operations window as well as the number of real registers. The processor is functionally described by using C++, and it has a VHDL counterpart, where each block is described with a VHDL subset ready for syn-
thesis. The parameters that de®ne the functions are described in a ®le that is used either for the C simulation or the VHDL simulation and synthesis. In this way, the C and the VHDL descriptions of the processor model match each other, we are sure that the instances of the model used in a particular design also match. This is very important, since the homogeneity between two descriptions done in dierent languages is always matter of concern for the possibility of introducing disparity between the descriptions. The starting point of the design is the de®nition of the speci®cations of the processor that, in simple terms, are the answers to these three main questions: · What does the processor do? i.e. the algorithms that must run on the processor, or, in absence of a determined set of them, a set of benchmark programs that best qualify the application that will run on the processor. · How much does it cost? i.e. the silicon area occupied by the processor VLSI implementation. · How fast has it to be done? i.e. the time used for performing the given tasks. Once the speci®cations have been determined, the design can start. It is an optimization process that aims at matching the speci®cation intended as the behavior and the performance, with the constraints, such as the cost of the whole design and fabrication. 3.1. The design ¯ow The design ¯ow is composed of two parts: (1) an analysis phase that extracts the characteristics of the application; (2) a design phase, where the micro-architecture of the processor is adapted to the characteristics. Fig. 4 shows the design ¯ow, which is composed of the following steps: · The benchmark programs best representing the characteristics of the applications are compiled using the g.c.c. compiler.
A.D. Gloria / Journal of Systems Architecture 45 (1999) 1139±1149
1145
Fig. 4. The design ¯ow of the application dedicated Superscalar processor.
· Each benchmark is run and a collection of run time information is kept in a ®le. · The ®le is analyzed and useful information is extracted both from the trace produced at run time and from the static form of the compiled program. · Information about instruction grouping is collected in order to determine the instructions of the ®nal architecture. · The speci®cations about the cost along with the information extracted from the execution of the benchmarks are used to produce the de®nition of the parameters of the architecture. · Execution of the benchmarks on the de®ned architecture is ®nally simulated and the perfor-
mance is compared against those de®ned by the speci®cations. If the result is not satisfying, the ¯ow loops in the design phase until either a reasonable solution is obtained, or the problem is found without solution (in the sense that the constraints imposed are too severe to ®nd a solution). 3.1.1. The in®nite resource architecture An important role in the analysis phase is played by the architecture used to extract the features of the application. We use an architecture with ``in®nite resource'' (IRA). This is because the architecture must in¯uence as less as possible the
1146
A.D. Gloria / Journal of Systems Architecture 45 (1999) 1139±1149
analysis. If we have an architecture with an in®nite hardware parallelism and an optimum scheduling algorithm and an in®nite size operation window, we can extract at run time, from the program, a certain number of features that do not depend on the physical structure of the architecture. The simulation of this architecture with the benchmarks gives us information about the maximum obtainable performance. This answers a primary question: what is the maximum clock cycle time allowed for performing the given task in the given time? And of consequence: is the chosen technology the right one for the current job? If we also make an analysis of the parallelism exploited in the simulation of the IRA we can have a roughly estimation of the cycle time required. In a ®rst approximation, we can use folding techniques for dealing with the extra parallelism not supported by our target architecture. Therefore with simple arithmetic operations we can compute the total number of cycles needed to run a benchmark: NC NCIRA ´ AVF, where NCIRA is the number of cycle of the IRA architecture, and AVF the average number of folding needed for dealing with the limited amount of resources of the target architecture. This ®rst order estimation gives an idea of which potentialities of the technology and of the micro-architecture we are going to use for dealing with the speci®cation of the design. In practice, it is not possible to implement a real IRA but only a satisfactory approximation. We decided to emulate an IRA imposing the following characteristics: · The instructions coincide with the operations that drive the data path. This choice is dictated by the goal of grouping more operations to form a single instruction. · The number of register is 255. This guarantees, in most cases, that the compiler does not perform register spilling. This is intended to avoid as much as possible that the behavior of the benchmark depends on the architecture.
· The instructions of the architecture correspond to those de®ned by gcc. This is done for achieving the best performance on the compiler part and for avoiding any change in the behavior of the benchmarks. · The level of branch nesting is 5, that should guarantee a good level of parallelism. · The branch statistics is almost ideal, in the sense that we keep a very large branch table where every branch is memorized. · The scheduling mechanisms consider an instruction latency of 1 cycle. · The size of the instruction window is 255. · The processor has no cache. The program ®ts all in memory. This assumption leads to consider a virtual perfect cache. Memory latency is assumed to be always 1 cycle. We decided to use the ``gnu tools'' because they are free and can be used by anyone. So that, we have something like a standard, that can be used to compare results from dierent researches. 3.1.2. The run time data Data collected at run time represent the source for any subsequent analysis, allowing to discover real parallelism in the application. A relevant limit to the collection of the run time data is the huge amount of memory needed. If, for example we store a byte for each instruction executed, we need, for some Specmark program at least 20 Gbyte of memory. This is impractical. Therefore, we decided to record only the information that cannot be extracted by the static program. We collect information about each branch executed by the program, in this way, and with the static program, we can reconstruct the ¯ow of the program and get all the information about the execution but the operands of the instructions. We also collect information about which instructions are executed in parallel. At run time, we identify the groups of instruction that are executed in parallel and collect information about the percentage of time each
A.D. Gloria / Journal of Systems Architecture 45 (1999) 1139±1149
group is executed. In this way, we can de®ne the instructions of the architecture we are going to design. 3.1.3. Features extraction In order to explore the architectural space to ®nd a suitable solution for the design we need to analyze the data produced by the simulator and the static code. This is done through the feature extraction program. It receives two types of input: (1) the characteristics of the applications, (2) the constraints imposed on the architecture. The former are represented by the data collected run time though simulation and from the binary code of the applications. The latter are represented by the type and number of allowed functional units, the number of registers and their organization (the number of banks, ports etc.). These data are organized in rows, each of which contains a feasible hardware combination. For instance, in a row we can have two adders, one parallel multiplier and a barrel shifter, while in another one, we can have two parallel multipliers and a single bit shifter. These two rows represent two possible con®gurations that need more or less the same area. The same is provided for the registers. For example, we can have a single 64-registers bank with four ports per register, or two 32-registers banks with 6 ports. The complexity of the decoding logic is taken into account too. This is accomplished by de®ning which combination of operations can be grouped to form one instruction. Each combination represents a certain functionality that needs some operands and sometimes requires scratch registers to be performed. The decoding cost does not take into account the number of scratch registers, since they are extracted in the target architecture from a look-up table, and the logic needed for a look-up table is the same for any number of register associated to a given instruction. The number of scratch registers aects the dimension of the table, but the instructions that usually require scratch
1147
registers are very few, and we decided that the dimension of the table can be considered negligible. From a VLSI implementation point of view, the most relevant factor that intervenes in the decision of implementing a certain instruction lays on the number of operands needed to specify the instruction. This is not an absolute factor, it depends on the number of registers, and therefore on the number of bits required for coding a register reference. It depends also on the instruction width, and, of course, on the number of instructions. Therefore, we provide to the program a set of basic rules that simply refer to the complexity of a certain type of coding. At this stage of the research we do not aim at implementing a complex optimization algorithm, that could result useful as a second order step of investigation. We want to investigate the feasibility of the approach we are pursuing, that is the possibility of achieving good results. The program ®rst collects information about the application: ± Statistics about the type of operations executed, this is needed for getting the following data: · which functional units are needed in the application · which bypasses among the functional units and between them and memory are needed · what is the number of physical registers actually used ± Statistics about the intra-units and inter-units parallelism that is needed for de®ning how many units of a certain type are needed and also for de®ning the number of ports for accessing memory. ± Statistics about the number of consecutively rightly predicted branches: which gain in parallelism of useful instruction is achieved, in order to determine the number of branch nesting. ± Statistics about the distance, in number of memory locations, between two operations executed in the same cycle, in order to determine the size of the operation window.
1148
A.D. Gloria / Journal of Systems Architecture 45 (1999) 1139±1149
On the basis of the above data, ®rst the program proceeds to de®ne the number of branch nesting levels, the number of memory ports and the size of the re-ordering window. On this basis and with the inter- and intra-units parallelism it can roughly estimate the parallelism available with the introduced hardware constraints. Therefore, the number of functional units and bypasses can be computed. Once the micro-architecture has been de®ned the program proceeds to de®ne the instruction set architecture. This is achieved balancing the needs of grouping certain operations with the constraints given by the width of the instruction and the hardware complexity. 3.1.4. Architecture simulator This program simulates the architecture as it has been de®ned by the features extraction program. It is only intended to be an instrument to verify the result produced and not as a tool for de®ning the architecture. This is due to the long time needed for simulating the benchmarks that lead to a prohibitive design time. The choice of the characteristics of the architecture is left to the features extraction program.
the prediction of the performance of a given architecture without resorting to extensive simulations. This represents the most time consuming task; moreover, also the collection of data demands a lot of disk space that might not be manageable and requires much time for performing an accurate analysis. We have sketched some baseline for performing an analysis that does not need extensive simulation but the research is still at the ®rst step. Another ®eld of research regards also the veri®cation of the microprocessor model either embedded in the system or standalone and for synthesis purpose. Usually for performing these three steps, three dierent models are used and this leads to possible errors in moving from one model to another. Formal methods for hardware veri®cation could be an answer to this problem but also in this ®eld more work has to be done. In conclusion, the research on microprocessors needs very broad expertise. Fundamental steps have to be taken for freeing the design of microprocessor from trial and error activities and delineate a new approach that allows designer to design correct architectures in a right forward manner.
4. Concluding remarks
References
In this paper, we have proposed an approach to the design of microprocessor for embedded systems. The proposed methodology is currently under use for in-house design of dedicated systems. We think that the requirements of a short time-tomarket and the possibility of having a programmable microprocessor cell are driving factors for future design of electronics systems. In this view, we have developed our methodology. We also think that the main factor, major research eort is to be put on, is the development of methods for
[1] J.A. Fisher, S.M. Freudenberger, Predicting conditional branch directions from previous runs of a program, in: Proc. Architectural Support for Programming Languages and Operating Systems (ASPLOS-V), 1992. [2] M. Johnson, Superscalar Design, Prentice-Hall, Englewood Clis, NJ, 1990. [3] J. Fisher, Very long instruction word architecture and the ELI-512, in: Proc. of The 10th Annual International Symposium on Computer Architecture (ISCA10), 1983. [4] J.A. Fisher, P. Faraboschi, G. Desoli, Custom-®t processors: letting applications de®ne architectures, in: Proc. of The 29th Annual International Symposium on Microarchitecture (Micro-29), 1996.
A.D. Gloria / Journal of Systems Architecture 45 (1999) 1139±1149 Alessandro De Gloria received the Master's degree in Electronic Engineering at the University of Genova in 1980. In 1982 he received the specialization degree in Computer Science at the University of Genova. In 1983 he joined the VLSI Design Center of the Genova University as Research Scientist. His
1149
main research interests include VLSI design and parallel architectures. He authored more than 50 papers in those research ®elds and he participated as session chairman at conferences focused on VLSI.