theVLSl journal INTEGRATION,
the VLSI journal
23 (1997)
95-l 11
An embedded CDMA-receiver A design example Jack P.F. Glas* Delft University of Technology,
Delft, Netherlands
Abstract This paper describes the design of the receiver part of a communication system providing ad-hoc, random-access datacommunication links. To enable such links we choose to apply an innovative CDMA-communication technique based on two different spread spectrum schemes: direct-sequence and frequency-hopping, In this way we evade the interference problems usually existing in non-cellular systems. The main receiver tasks are removing the “spreading” from the incoming signal, acquiring synchronization of the receiver to the incoming signal and recovering the data message. The operation of such a receiver can easily be caught in an algorithmic description, a software implementation would therefore be advantageous for cost and flexibility reasons. However, it is not yet possible to meet the hard timing constraints existing in the system with a software-only realization. Some time critical parts of the receiver will have to be implemented in hardware evolving to a so-called embedded system. The key advantage of embedded system design is that the hardware and software parts are designed concurrently enabling an almost seamless cooperation between the parts. To efficiently explore the large design space of possible hardware/software partitionings an automatic HWjSW partitioner is required. Such a partitioner should handle imprecise cost data available at that stage to reduce the risk of getting outside the specification. Yet by manipulation of the input data and by locking certain intermediate results the designer can control the partitioning process. Keywords:
CDMA-receiver;
Design example
1. Introduction Modern communication systems rely heavily on advanced signal processing techniques like filtering, and (Fourier-) transforms. Implementing these functions in dedicated hardware (asics) usually provides the designer with fast implementations at the price of huge design and production costs. Also a dedicated signal processor can yield rather fast results at the cost of an extra processor. In addition to the loss of speed, this option is less attractive as compilers for signal processors are hard to find. The least expensive method is to apply a general purpose processor, but quite often speed requirements cannot be met in such cases.
* Correspondence address: Bell Labs, Lucent Technologies, 07974, USA. Tel.: +1-908-582-7766; fax: +l-908-582-1239;
600-700 Mountain Avenue, Room 26-217, Murray Hill, NJ e-mail:
[email protected].
0167-9260/97/$17.00 @ 1997 Elsevier Science B.V. All rights reserved PZZ SO167-9260(97)00016-3
96
J.P.F. GlaslINTEGRATION,
the VLSI journal
23 (1997)
95-111
Yet another way of implementing these functions is as dedicated functional-units in an Application Specific Processor. In this way, the dedicated hardware is embedded in a processor environment and can be integrated together with a processor on one chip. An additional advantage is that functions can be mapped into software or hardware, so that HW/SW partitioning can be used to find a cost-efficient design that meets the performance requirements. In this paper, we describe the design trajectory of such an embedded system. We start with a system description of a communication system providing ad-hoc communication links with the possibility of random-access. To this end an innovative CDMA (spread spectrum) scheme is applied, this technique is explained in Section 2. The second part of that section is devoted to the system architecture. Section 3 deals with the question why we would use an embedded approach to realize such a receiver. This section also provides an outline of the design process and gives a specification of what a partitioner should do. Section 4 first focusses on the translation of the system architecture via a C-description to an intermediate format which is used as a basis for the HW/SW-partitioning process. Furthermore, it briefly describes how we derived the cost functions for the possible implementation alternatives. Section 5 starts with the intermediate format from the previous section and evaluates the partitioning process itself and discusses the results. Also a possible processor architecture is given and an example of a realized functional unit is shown. Finally, we will conclude with an evaluation of the design process.
2. System-level
description
Before going into detail about the design process, first the receiver concept should be made clear. Understanding this concept is required to get an overview over the whole system and the design considerations that play a part in it. First the transmission technique will be explained; we motivate the choice of using a hybrid CDMA technique to transmit the data. Furthermore, we will shortly focus on the data-modulation technique. The second part deals with a possible system-architecture. 2.1. CDMA
transmission
to realize ad-hoc communication
links
The objective of our system is to enable ad-hoc communication links without the need for a complex infrastructure. From this consideration we automatically arrive at a system with a non-cellular structure. This concept however limits the number of usable system architectures. Frequency division multiple access (FDMA), commonly used in telephony systems, either needs a very wide frequency band (every user its own band) or introduces the need for a protocol which examines the frequency spectrum to select a non-occupied band. Also the intended receiver should be notified on which band to listen. A second possibility is time division multiple access (TDMA) used in for instance GSM. This technique introduces the need for a central time-reference and a protocol to assign time-slots to different users. The latter is undesired as an infrastructure is needed to realize such a system. To overcome these problems it is also possible to apply code division multiple access (CDMA). Now the users do not have their own frequency-band like in FDMA or their own time-slot like in TDMA. In CDMA all users transmit in the same frequency band at the same time. To distinguish the messages, users have their own unique code. This code is in some way combined with the data signal
J. P. F. Glas I INTEGRATION,
the VLSI journal
23 (1997)
95-l 11
97
(spreading) to be transmitted. In the receiver the code is removed from the signal (despreading), to recover the transmitted data. As the codes are selected on their low cross-correlation properties, it is possible to receive just that signal that you are interested in. Other advantages of CDMA are its robustness against channel distortions (fading due to, for instance, multi-path), its ability to provide privacy (data detection is not possible without knowing the code) and its interference-limited operation. (The whole spectrum is used, even if only one user is active. In FDMA, the spectrum is under utilized if only a couple of users are active.) Combining a data signal with a code can be done in several ways. All techniques have one common property: the bandwidth of the initial signal increases (spreads). A CDMA technique commonly used in practice is direct sequence [l]. In this technique, a pseudo-random noise code-sequence (PN-CODE) of length Nos is multiplied with every data symbol. The spreading factor in such a system is equal to the code length. The advantage of this method is that the implementation is simple. A drawback however is that this technique is very sensitive to differences in received signal level from different transmitters: the signal level from the intended user should be equal to the signal level from the interfering users. If not, it is possible that the cross correlation of the interfering code with the intended code gets higher than the autocorrelation of the intended code. In such cases it is obvious that correct detection is not possible. This problem is called the “near-far effect”. In practical systems, power-control protocols take care that all signals arrive at equal power level at a base station [2,3]. In non-cellular systems, there is no solution to the near-far effect in direct-sequence systems. A technique that partly solves the near-far effect is frequency hopping in which the transmitter “hops” in a certain sequence between a number (Nrn) of carrier frequencies. A drawback of this scheme is that it requires a frequency synthesizer capable of fast “hopping”. Such a synthesizer is undesirable as it is hard to make and cost intensive in the sense of power consumption and area usage. As using only direct sequence suffers from the near-far effect and frequency hopping introduces higher costs, we propose to apply a combination of them [4, Rl, p. 301. In this way, we combine the advantages of both systems while reducing their shortcomings. Every data symbol is combined with a direct-sequence code while subsequent symbols are transmitted at different carrier frequencies. This concept is illustrated in Fig. 1.
FH-sequence frequency-hop = symbol-lime
carrier
2
carrier
6
carder
6
carrier
7
,.
(PNWdB
Fig. 1. DS/FH spreading scheme.
98
J. P. I? Glas I INTEGRA
TION,
the VLSI journal
23 (1997)
95-1 I1
After choosing a spreading scheme, we have to make another choice: how to modulate the carrier with the data signal? To avoid the need for (hard) carrier-phase tracking and to lower the sensitivity for channel distortions [Rl, p. 321 we chose to apply 16 (Multiple) frequency shift keying (16-MFSK [5, P.
2951).
2.2. System architecture The operation of the receiver is as follows: after reception, the signal enters the front-end, where it is filtered, amplified and down converted. Next the FH-despreading takes place: mixing the signal with the FH-synthesizer output. After band-pass filtering the signal enters the A/D-converter where it is quantized and quadrature sampled. To enable accurate code tracking (synchronization) the sampling rate is asynchronous to the chip rate. The resulting I/Q-signals are despread (mixed with the prompt output of the PN-code generator). Now that we removed all the spreading from the received signal we have to detect which symbol was transmitted. As we selected a 16-MFSK modulation scheme, we have to examine 16 frequency bins for power contents. The symbol corresponding to the frequency bin with the highest power contents is estimated to be the transmitted symbol. As we only consider the power contents, not the phase (which is not required for MFSK-demodulation), we have a non-coherent receiver architecture. To perform the spectrum examination, we apply a simplified discrete fourier transform (DFT). Applying a DFT on the incoming signal would increase the computation requirements far above the level which is feasible to implement on a SOG-chip. We overcame this problem by developing a simplified DFT-algorithm, in the following referred to as DFT-correlation engine (DFT-CE). It uses two-leveled input signals and works internally with binary and three-leveled numbers [Rl, p. 69, R2]. This approach enables MFSK-demodulation on the SOG-chip on account of a loss in signal-to-noise ratio of maximally 1.06 dB The resulting spectral information is used by both the data detection unit and the synchronization unit. If no strong spectral component is found (relative to other components), it is assumed that the local PN-CODE and the received signal are not in lock. The result is a search for code acquisition: the local time reference is shifted along the received signal, to obtain synchronization at another relative shift. A second (smaller) DFT-CE which transforms the early/late path signal, is used for code tracking (fine-tuning of the local time reference to the received signal). The algorithm applied for code tracking is based on the modified code tracking loop [6], adjusted for the use in a non-coherent receiver architecture [Rl, p. 951. From Fig. 2 it is clear that the receiver basically exists of two parts: An RF-section which deals with high-frequency signals (the RF-frequency is in the 2.4MHz ISM-band), this section also includes the frequency hop despreading stage. This front part of the system will be realized off-chip. The second part of the system will be realized on-chip. It exists of a signal-conditioning section which implements the hardware functionality of the system. In the figure the components inside the dotted-line box can easily be captured in an algorithmic description. A complete software implementation for this part would be advantageous for low cost and flexibility reasons. In the next section we will see why such an implementation is not feasible.
J. P. F. Glas I INTEGRATION,
the VLSI journal
23 (1997)
95-l 1 I
99
Fig. 2. Receiver architecture.
3. Designing
a mm-receiver
From the previous section we know how the system architecture of the proposed CDMA-receiver looks like. This section deals with its implementation. First, we will address the question why we will build the receiver as an embedded system. Second, we will focus on the design process itself. As the design process heavily depends on the target architecture, we also focus on the architecture choice. We will conclude this section with our expectations towards the HW/SW-partitioning stage. 3.1. Why an embedded realization?
As stated before, a large part of the receiver operation can be written in terms of an algorithmic description. For this part a complete implementation in software would be advantageous. Software implementations have the advantages to be cheap (no application-specific hardware) and flexible. At this moment however a software implementation will certainly not meet the hard timing constraints. When talking about real-time systems like a receiver, one should distinguish systems in which hard timing constraints play a role, from systems in which this is not the case. In the latter category, one can find applications like laser printers, washing machines, etc.; timing only concerns performance, not the functionality. In the category of real-time systems with hard timing constraints, one can find applications like TV-sets, automotion systems, radio receivers, etc. If timing constraints are not met, the system does not function properly. If, for instance, a TV-set is not able to process one frame before the next frame starts, functionality is lost. Our application is an example of a real-time system with hard timing constraints. Like in the TV-set example, processing has to be completed within a fixed frame, otherwise the system does not work correctly. So implementing the whole receiver in software is not an option. A logical solution now is to build a part of the receiver in hardware while another part of it stays in software. In this way we get an embedded system: an embedded system implements certain real-time functionality by usage of an optimal combination of dedicated hardware and software working together and concurrently. Next step is to decide how embedded system design can be done efficiently. Traditionally, an embedded system was just a piece of hardware and one or more processing elements which worked
100
J.P.F. GlaslINTEGRATION,
the VLSI journal
23 (1997)
95-111
together to perform certain functionality. After the manual choice of what to do in hardware and what to do in software different (groups of) people could start building their parts. Nowadays systems are getting more complex and requirements concerning the efficiency are getting more tough. By doing the design of the hardware and software parts of the system “together”, the boundary between those two parts gets lower. Now the interaction between hardware and software can be taken into account and also realizations of certain parts can be moved from hardware to software or the other way around. Usually, such a design is started with an algorithmic description in a high-level language. Then the designer is in control of choosing a processor, doing the hardware/software partitioning and designing the different parts [7]. In many situations an embedded system consists of a combination of a general purpose processor and a co-processor [8], or as a standard programmable element together with an ASIC [9]. Yet another flavor of embedded system design is by lowering the boundary between hardware and software even further. This can be done by implementing a system in a single processor environment. Now the software functionality is implemented on a processor while it is also possible to include user-defined functionality in this processor architecture. It is clear that in this way the cooperation between hardware and software gets almost seamless. The question of hardware/software partitioning now becomes the question of “What for and how many functional units should be included in the processor environment?“. In such systems the main controlling unit is the processor, as a result this leads to a software dominated embedded system. With “designing an embedded system” we also mean that the hardware and software parts are designed concurrently. This approach will lead to a system in which the hardware and software parts are optimally “tuned” to each other. Also the question of what functionality to implement in hardware and what functionality to put in software should be answered. In our point of view the designer should be “advised’ by automatic tools during the design. This approach has a number of advantages: 1. As a designer is usually a hardware designer or a software designer, the resulting design is likely to be optimized either for hardware or for software and not for the complete system. A tool that gives a more objective view has advantages from this point of view. 2. As the size of the systems keeps growing, it is getting more and more difficult to have a single designer understanding the whole system. 3. An automatic system is capable of giving fast feedback on different HW/SW partitionings. In this way the designer can quickly go through the large design space to get a “feeling” for the possibilities, 4. It would be advantageous if designs could be done by non-experienced designers. 3.2. The design process Before starting the hardware/software partitioning stage, first a processor environment has to be chosen. This choice is essential as it determines the costs of software implementations. The choice is done after examining the boundary conditions by making a selection out of the set of available processors. The design process can be summarized as in Fig. 3. We start with a system implementation which is to be written down in terms of a parallel C-description. This description can be converted into a control and data flow graph. These graphs in combination with constraints and data on
J.P.F. GIasIINTEGRATION,
the VLSI journal
23 (1997)
95-111
101
Fig. 3. Designprocess.
the implementation alternatives (costs representing latency and area usage) are the input to a first HWjSW partitioning run. During this (automatic) step the critical paths in the design are traced down. If the partitioning for some nodes is forced into certain alternatives because otherwise constraints are not met, these nodes are fixed. A complete partitioning solution is the result. After this first run an iterative and interactive stage starts. The designer can fix/free nodes (choosing alternatives), and can let the software optimize for sets of remaining nodes. The designer decides when to stop the partitioning process. In this way the designer can go through the design space to find an optimal implementation. For this reason it is essential that the designer stays in control of the partitioning process. After accepting a partitioning, the implementations for the chosen alternatives are to be found. This structural description of the hardware together with the C-implementation of the software is simulated. For this purpose there is need for a co-simulator, PTOLEMY [lo] is extended to enable this kind of simulations. 3.2.1. Selecting a processor environment We saw that lowering the boundary between software and hardware as much as possible leads to an efficient implementation in which hardware and software work together almost seamlessly. To obtain this situation we should find a processor environment which enables this kind of designs.
102
J.P.F. GIasIINTEGRATION,
the VLSI journal
Fig. 4. Structure of a MOVE
23 (1997)
95-111
processor.
Starting point will be a number of requirements: 1. It should be possible to choose the amount of “standard functionality” freely. Besides a minimum processor configuration, it should be possible to decide upon the number of functional units (s) like registers, integer units, multiplier units, etc. 2. It should be possible to implement “application-specific functionality” (ASP-FUS) within the processor architecture. 3. For the processor the interface between the minimal processor configuration and FUs or ASP-FUS should be almost the same, for the controller they should look equivalent. However, if a FU has a connection with the outside world, a processor-hardware synchronization protocol is likely to be required. 4. It should be possible that the latencies of different FUs or ASP-FUS are different. 5. Instruction parallelism should be possible. The main advantage of hardware realizations is that it enables parallel processing. If the processor architecture is sequentially going through a set of operations, this advantage is lost. An architecture that fits these requirements is the MOVE-architecture proposed by Corporaal and Mulder [l 1, 121. A schematic view of such an architecture is shown in Fig. 4. Typically, a MOVEprocessor is built round a set of busses: the MOVE-bus. Different kinds of FUs can be connected to these busses via a so-called SOCKET. A SOCKET usually has 3 registers: an operand register, a trigger register and an output register. The operand and trigger registers are inputs. When data arrive at a trigger register, de FU connected to this socket starts its operation. After a certain latency the result appears at the output register. If we, for example, want to add two numbers a and b, we move a to the operand register of a socket connected with a adder-fu. The second number b is moved to the trigger register. At that moment the addition starts and a number of cycles later (latency) the result can be moved from the output register. The advantages of this architecture are its simplicity and flexibility: FUs can be added or removed easily. The flexibility however is also creating a new problem: how to handle the large design-space? We already mentioned that the number and kind of FUs can be chosen freely, but there are also other parameters to be chosen such as: bus width, number of busses, address range, register-file size.
J.P.F. GlaslINTEGRATION,
the VLSI journal
23 (1997)
95-111
103
3.2.2. Expectations towards the hardware/software partitioning stage In the previous section we saw that it is advantageous to have an automatic tool that guides the designer through the large design space of possible HW/SW-partitionings. To this end, tools are being developed. Examples are COSYMA [8], VULCAN [9] and HSPART [13,14]. A difficult part in the design process is to supply cost data on the possible implementation alternatives. To make comparisons between the different partitionings the automatic tools needs data on area occupancy and latency. As these data are not available at this stage, the designer will usually have a hard time collecting it. A save way to do this, is by designing the actual software and hardware implementations and then use profiles and simulations. This however introduces the need for extra manpower. In general, the data resulting from such exercises will contain uncertainties which in their turn introduce the risk of obtaining an inefficient or even invalid partitioning. So investing much effort in the extraction of cost data is not a guaranty for obtaining an optimal result. This problem can be overcome by applying HSPART, a tool built on top of the CASTLE-environment [ 151. The application of this tool avoids the requirement to supply exact cost data. The algorithm implemented in this partitioner uses imprecise (possiblistic) input data: supplying a “most-possible” value, a “minimum” value and a “maximum” value is sufficient. Besides reducing the required effort of the designer significantly, the partitioner takes care about timing-constraints and reduces the risk of getting a design outside the specification. To summarize, we expect the hardware/software partioning stage to be guided by an automatic tool but controlled by the designer. It should give answers to the following questions: 1. What tinctions are going to be mapped on standard functional units, what functions on specific hardware? 2. How much of what kind of standard functional units are required? 3. What clustering should be applied to arrive at how many application specific functional units?
4. Hardware/software
partitioning
stage
Here we will focus on the experiences during the hardware/software partitioning stage. First the C-description of the CDMA-receiver will be dealt with. This also includes the derivation of the possiblistic cost data for the various implementation alternatives. The second part of this section is dedicated to the partitioning process itself. 4.1. Deriving
the intermediate
format
The first step is to describe the system. For this purpose we will use the C-language enhanced with parallel constructs. Applying a number of CASTLE-tools [ 151 in combination with STONE [16] results in a simplified SIR-graph. The nodes of this graph represent all functionality which either could be put in software or in hardware. Except for the C-description which can be translated to a combined control-flow and data-flow graph also the cost data should be provided. For every node one or more possible realizations (hardware or software) can be specified. For every realization a possiblistic estimate on the cost data should be supplied. To incorporate interfacing costs, also data for the data-flow edges is required.
104
J.P.F. GIasiINTEGRATION,
the VLSI journal
23 (1997)
95-111
The designer is able to influence the structure of the simplified SIR-graph in several ways. During the translation of C-to-SIR, only the function calls from the main function will appear in the graph. By structuring the C-code in a certain way the designer decides which nodes will appear in the graph and so which nodes take part in the partitioning process. This possibility to influence the process also brings responsibility. If, for instance, too many function calls appear in the main function too many nodes will appear in the SIR-graph. Now it gets difficult to get an overview of the system by looking at the graph, also supplying cost data becomes too more problematic. Another way in which the designer can control the structure of the graph is by choosing a way to deal with course grain parallelism like, for instance, pipelining. Also the “timing schedule” will influence the “structure”. For instance, the choice: what functionality to perform in-line (triggered by the incoming data stream) and what tinctionality can be performed triggered by the system clock (off-line)? Different structures can lead to different SIR-representations, which in their turn lead to different HSjSW partitionings [ 171. Now we will return to the proposed CDMA-receiver. The C-code representing the operation of this receiver is in total about 1700 lines of code. This C-description is converted to the intermediate format: a simplified SIR-graph which is shown in Fig. 5. As the graph only represents the function calls from the main function with their dependencies, the translation from the C-description to the SIR-graph is not reversible. The “ovals” represent the tinctions while the “rectangles” represent the variables. A brief description of the receiver process is the following: The function iTwiddle initializes the DFT-CE (data detection) and is called once when the receiver is switched on. Below this node we see basically 2 pipeline stages: one stage operates synchronous with the incoming signal (in-line) (left side, starting with resetPNgen, phaseShift and reset&$). The other stage is running off-line at the system clock. The left arm starts with another set of initializations executed once at the start of a new symbol period (resetPNgen, phaseShift and reset&f). Below those operations we find a loop which is executed 260 times (260 samples make one symbol period). In there we find a prompt path in which despreading with the prompt code takes place. Also the DFT-CE (8 correlations in parallel) is executed there. On the right side we find the early/late-path, in which despreading with a shifted PN-CODE is performed. This path is required in the code-synchronization scheme. The outcome of this path is stored in a buffer (trBuf). The right arm implements the second pipeline stage, it processes the data from the previous symbol period. This stage starts with the results from the DFT-CE from the left arm. peakdetect selects the MFSK-channel with the highest energy contents and so decides what data-symbol was transmitted. After that, another (smaller) DFT-CE operation starts to analyze the results of the early/late path. On the basis on these results the local time-reference is shifted (via the variable trackcntr and the function phaseShift). Also a test is executed to check whether there is still synchronization (acqdetect). 4.2. Deriving
the cost functions
The following step is to find the cost functions for the implementation alternatives and edges. Data are provided in the form of triangular fuzzy numbers [14, Ch. 51: x = (Xrn,X(,XU)
of all nodes
J. P. F. Glas I INTEGRA
TION,
the VLSI journal
23 (I 997) 95-111
105
106
J.P.F. GlasIINTEGRATION,
the VLSI journal
23 (1997)
95-111
Here xrn represents the most-possible value, xt the lower-bound value and x” the upper-baud value of variable X. Although the introduction of possiblistic input data eases the required cost-estimation process considerably, deriving the cost data is still tedious work: In general, the hardware, software and interfacing costs of the functions under consideration are not known at this stage. If this was the case the system already existed! In the following, we will concentrate on how the process of cost estimation is done. As the cost data are different for different technologies we will first describe our technology. For the hardware cost data we will assume to have a semi-custom ic fabrication process available which uses the fishbone image: a gate-isolation image in a 1.6 u CMOS process with two-level metallization. The complete system (both hardware and software) should be realized on a single chip which contains 100 000 n/p transistor pairs. The sea-of-gates design system OCEAN [ 181 is being used for prototyping. Concerning the software cost data, we will assume to have a MOVEprocessor clocked at a speed of 41.6 MHz. Let us make a distinction between data and functions. For data objects timing is not relevant in this context, the only time that plays a role is the interfacing time which is specified separately. The estimate on the size of the data is rather easy. The hardware cost data can be expressed as a cost per bit storage, which can be multiplied by the number of required memory elements. Uncertainties in this context are: do we need any reset? can we apply dynamic logic? how “clever” is the design? On the basis of experience, we define of the cost of a single bit-storage element to be (in n/p-transistor pairs): C area,
Ripflop
= (16,12,21).
On the basis of this number the hardware cost data of all variables can be determined. The software cost data are expressed by crisp numbers equal to the number of bytes used. The cost estimation of the functions is more complicated. The estimations can be based on previous designs, automatic generated designs or just experience. For the moment we have to rely on designer’s experience. The estimates used in this example are based on data from previous designs adjusted for the changes introduced. The costs of software functionality is derived from a compilation and simulation of the code for the target architecture. To this end the code-generation software belonging to the MOVE-environment is used [ 191. Uncertainties in this sense are the amount of parallelism possible. To derive proper cost values, we chose a small MOVE-configuration (2 busses, 1 ALU, 1 MULTIPLIER and a load-store unit) and had the scheduler do its job. The sequential code-sizes/latencies are used as maximum values, while the parallel code-sizes/latencies are used as most possible values. The minimum values are based on the scheduled numbers which are adjusted for the possibility of a larger MOVE-configuration and manual optimization of the assembly code. The third set of cost data contains the interfacing cost, for this cost there are three possibilities: 1. hardware to hardware cost data will be assumed to be cheap: once the signal is available it takes only a connection to pass it to another functional block; 2. software to software cost data include putting data on the bus, and reading it from the bus (via sockets); 3. mixed software to hardware cost data also include reading or writing to the bus, before or after that operation the data is available in hardware.
J.P.F. Glas IINTEGRATION,
the VLSI journal
23 (1997)
95-111
107
4.3. Timing constraints For the receiver we define three path-latencies: two in the in-line path (left arm) and one for the off-line path (right-arm). In the in-line path we split the prompt path from the early/late path. All path latencies have a maximum value of 40 JLS. This leaves 10 ps for the (SW) controlling tasks which do not appear in the graph (a complete symbol period takes 50 ps). By choosing these latencies we capture the three critical paths of the receiver in three different path latencies.
5. Partitioning
results
HSPART is configured in such a way that it optimizes the used ic-area under the condition of the timing constraints. The area we have available for the system is one soG-chip, which has about 100 000 n/p transistor pairs. On this chip we should put the HW-functionality as well as the software functionality. The MOVE-configuration used for to obtain the profiling data uses about 60% of the chip area, which leaves 40% or 40 000 transistor pairs for the dedicated hardware. After a first partitioning run we decided to force the tracking buffer (trhf) and the twiddle factors (data sequences used by the DFT-CE) (twf’rompt and tw Track) in external hardware. The reason for this is that the amount of data to be stored in these variables is too much to put on the chip. The result of the partitioning run can be seen in Fig. 6. The resulting cost data in possiblistic form are shown in Table 1. So for the chosen configuration the timing constraints are met, while also the ic-area is reasonable (maximally 30% of a SoG-chip). It should be noted that these numbers are only indications, the real cost data are only known after implementation. In the top left corner of Fig. 6 we see the three path-latencies and the one available processor (MOVE). The dark blocks are advised to be put in hardware while for the light-blocks a software implementation is chosen. What we see from this picture is that the in-line processing path is almost completely put in hardware. This is a sensible choice for the following reasons: - The algorithm behind the correlators that form the largest part of the in-line processing path, is optimized for efficient hardware realization. For example, the processing is done with two- and three-leveled signals instead of 16-bit signals. In hardware this saves space while in a software implementation still uses whole bytes. - Some functionality in the system is clustered in hardware. The PN-CODE GENERATOR for instance: if one of the functions dealing with this generator is put in hardware, the generator will be put on the chip. The other functionality of the generator is then available and that functionality will be put in hardware too. Table 1 Final costs HW chip-area
(20 963, 19 720, 29 087)
n/p transistor pairs
Prompt-path latency Track-path latency Off-line-path latency
(16640, 8728, 24819) (13925, 10748, 17100) (34 167, 28 350, 39 669)
ns ns ns
108
J.P.F. GlaslINTEGRATION,
the VLSI journal
23 (1997)
95-111
J.P.F. GlasIINTEGRATION,
the VLSI journal
23 (1997)
95-111
109
MOVE-bus
path dedicated HW
1 hvTrsck (memow)
I
multiptisr
FH-synthesizer
,Ftt-carrier
Fig. 7. Systemfloorplanof embeddedreceiver.
- Except for the correlation,
the in-line path does not contain much processing. Simple operations on the input samples are executed, these operation can efficiently be mapped on hardware.
On the other hand, the off-line processing path contains more signal-processing tasks (peakdetect, trPhDet, etc.). Except for the PN-CODE GENERATOR which was already in hardware, they were put in software. Also the computing-intensive correlation operation (corrTr) is put in hardware, just like its equivalent in the in-line path. Following these results we add three application-specific functional units to the processor: 1. A DFT-CE which performs the correlation tasks and contains the related data. The functions to be implemented on such a unit are: COYY,corrTr, resetac and readRes, while the implemented data consist of the contents of the accus. 2. A PN-CODE GENERATOR which implements all the functionality related to the direct-sequence spreading: resetPNgen, shiftPNgen, getprompt, getEL and the variables: gen, promptcode and trackcode. 3. The third application-specific functional unit is the frequency-hopping synthesizer (needed for the frequency-hopping despreading). The functionality of this synthesizer did not appear in the graph as it was obvious that the high clock speed required in this circuit demands a hardware implementation.
110
J.P.F. GlaslINTEGRATION,
the VLSI journal
Fig. 8. sot-implementation
23
(1997)
9.5II1
of correlator FU.
Furthermore, the processor will contain “standard” functionality like a register file, a load-store unit, integer units and a multiplier. Separate hardware (not as functional units within the MOVEarchitecture) will implement a part of the in-line processing path: readsample, IQmpy etc. A possible floorplan of the chip together with off-chip memory is shown in Fig. 7. As an example of an FU, a Sea-of-Gates implementation of the DFT-CE is shown in Fig. 8. In the figure one can see the 8 prompt-path correlators (corr) and the one slightly different early/late-path correlator (corrTr) next to each other. The picture does not include the SOCKET to connect the DFT-CE to the MOVE-bus.
6. Conclusions In this paper we described a part of the design trajectory of the receiver part of an embedded CDMA-communication system. The receiver structure itself is innovative in the sense that it combines two spread spectrum techniques (direct sequence and frequency hopping) to beat the interference problems (near-far effect) usually existing in non-cellular systems. The design was done following embedded system design techniques: We start with a C-description of the system, this description is converted into an intermediate format (simplified SIR-graph). This SIR-graph together with imprecise estimates on timing and area costs of various implementation alternatives is used as an input to the interactive HW/SW partitioning process. Considering this process we can draw the following conclusions: - An automatic tool to guide the designer through the large design space of possible HW/SWpartitioning configurations is of great help to the designer. - It is important that the designer stays in control of the partitioning process. There are basically two ways to influence this process: by manipulating the input data (writing the C-description) and by locking certain intermediate results.
J.P.F. GlasIINTEGRATION,
- HSPART provides
the VLSI journal
the possibility to supply imprecise this eases the cost-estimation process considerably. results outside the specification is reduced.
23 (1997)
95-111
111
in stead of exact numbers on the cost data, Also the risk of getting inefficient or even
For the described receiver, the partitioning was made together with a description of the final implementation. It also appeared that the target system is quite suitable for exploring the field of HW/SW-partitioning. There are a number of properties which are characteristic for modern communication systems like real-time constraints, pipelined structures and digital signal processing modules. References [l] R.L. Pickholtz, D.L. Schilling, L.B. Milstein, Theory of spread spectrum communications - a tutorial, IEEE Trans. Commun. COM-30 (5) (1982) 855-884. [2] KS. Gillhousen, I.M. Jacobs, R. Padovani, A.J. Viterbi, L.A. Weaver, C.E. Weatley, On the capacity of a cellular cdma system, IEEE Trans. Vehicular Technol. 40 (2) (1991) 303-312. [3] R. Prasad, M.G. Jansen, A. Kegel, Capacity analysis of a cellular direct sequence code division multiple access system with imperfect power control, IEICE Trans. Commun. E76-B (8) (1993) 894-904. [4] J.P.F. Glas, On multiple access interference in a ds/llh spread spectrum communication system, in: Proc. 3rd IEEE Intemat. Symp. on Spread Spectrum Techniques and Applications, Oulu, Finland, July 1994, pp. 3-2. [5] J.G. Proakis, Digital Communications, 2nd ed., McGraw-Hill, New York, 1989. [6] R.A. Yost, R.W. Boyd, A modified pn code tracking loop: its performance analysis and comparative evaluation, IEEE Trans. Commun. COM-30 (5) (1982) 1027-1036. [7] P. Stravers, Embedded system design, Ph.D. Thesis, Delhi University of Technology, December 1994, ISBN: 909007-879-7. [8] R. Ernst, J. Henkel, T. Benner, Hardware-software cosynthesis for microcontrollers, IEEE Des. Test Comput. (1993) 64-75. [9] R.K. Gupta, C.N. Coelho Jr., G. De Micheli, Program implementation schemes for hardware-software systems, Computer (1994) 48-55. [lo] J.T. Buck, S. Ha, E.A. Lee, D.G. Messerschmitt, Ptolemy: a framework for simulating and prototyping heterogeneous systems, Int. J. Comput. Simulation (1994) 155-182. [l l] H. Corporaal, H.J.M. Mulder, Move: a framework for high-performance design, in: Proc. of Supercomputing ‘91, Albaquerque, November 1991, pp. 692-701. [12] H. Corporaal, Transport triggered architectures, design and evaluation, Ph.D. Thesis, DeIfl University of Technology, September 1995, ISBN: 90-9008662-5. [13] I. Karkowski, R.H.J.M. Otten, Uncertainties of hardware-software co-synthesis of embedded systems, in: Proc. Workshop on High Level Synthesis Algorithms, Tools and Design (Hiles), Stanford University, November 1995. [14] I. Karkowski, Performance driven synthesis of digital systems, Ph.D. Thesis, Delft University of Technology, December 1995, ISBN: 90-5326-022-6. [ 151 M. TheilJinger, P. Stravers, H. Veit, Castle: an interactive environment for hw-sw co-design, Proc. Intemat. Workshop on Hardware-Software Codesign, Grenoble, September 1994, pp. 203-209. [16] P. Cappelletti, The STONE User’s Manual, Delfl University of Technology, 1995. [ 171 J.P.F. Glas, Codesign in a cdma-receiver, in: Proc. Workshop on High Level Synthesis Algorithms, Tools and Design (Hiles), Stanford University, November 1995. [18] P. Groeneveld, P. Stravers, OCEAN: The Sea-of-Gates Design System, Delft University of Technology, 1993. [ 191 J. Hoogerbrugge, Code generation for transport triggered architectures, Ph.D. Thesis, Delft University of Technology, February 1996, ISBN: 90-9009002-9. [Rl] J.P.F. Glas, Non-cellular Wireless Communication Systems. Ph.D. Thesis, Delft Univ. of Technology, The Netherlands, December 1996, ISBN 90-5326-024-2. [R2] L.K. Regenbogen, preware communication. [R3] J.M.G. Nieuwstad, Hardware-Software Cosimulation of Move Processors using Ptolemy. Master’s Thesis, Delft Univ. of Technology, The Netherlands, 1996.