Microprocessors and Microsystems xxx (2015) xxx–xxx
Contents lists available at ScienceDirect
Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro
A new digital front-end for flexible reception in software defined radio q Isael Diaz a,⇑, Chenxin Zhang a, Lieven Hollevoet b, Jim Svensson c, Joachim Rodrigues a, Leif Wilhelmsson c, Thomas Olsson c, Liesbet Van der Perre b, Viktor Öwall a a
Lund University, Box 118, SE-22100 Lund, Sweden IMEC, Kapeldreef 75, B-3001 Heverlee, Belgium c Ericsson Research, Scheelevägen 19, SE-22183 Lund, Sweden b
a r t i c l e
i n f o
Article history: Received 22 April 2014 Accepted 1 March 2015 Available online xxxx Keywords: Multi-standard DFE Concurrency LTE DVB-H WLAN
a b s t r a c t Future mobile terminals are expected to support an ever increasing number of Radio Access Technologies (RAT) concurrently. This imposes a challenge to terminal designers already today. Software Defined Radio (SDR) solutions are a compelling alternative to address this issue in the digital baseband, given its high flexibility and low Non-Recurring Engineering (NRE) cost. However, the challenge still remains in the Digital Front-End (DFE), where many operations are too complex or energy hungry to be implemented as software instructions. Thus, new architectures are needed to feed the SDR digital baseband while keeping complexity and energy consumption at bay. In this article the architecture of a Digital Front-End Receiver (DFE-Rx) for the next-generation mobile terminals is presented. The flexibility needed for multi-standard support is demonstrated by detecting, synchronizing and reporting carrier-frequency offset, of multiple concurrent radio standards. Moreover, the proposed architecture has been fabricated in a 65 nm CMOS low power high-VT cell technology in a die size of 5 mm2. The core module of the DFE-Rx, the synchronization engine, has been measured at 1.2 V and reports an average power consumption of 1.9 mW during Wireless Local Area Network (WLAN) reception and 1.6 mW during configuration, while running at 10 MHz. Ó 2015 Elsevier B.V. All rights reserved.
1. Introduction Wireless communications is one of the fastest growing market segments. According to estimations by the International Telecommunication Union (ITU) there is today the same number of mobile subscribers, as the number of people living on this planet [1]. Moreover, each mobile terminal is expected to concurrently support a variety of Radio Access Technologies (RATs), as mobile terminals require to connect to specific services via different interfaces. Traditionally individual chip-sets customized per connecting RAT and interfaced in a system ecosystem to provide to the user a seamless connectivity experience. However, as the number of standards raises, this solution becomes, unpractical and cost-
q This work has been carried out under the MULTI-BASE project supported by the 7th Framework Programme (FP7) of the European Commission. ⇑ Corresponding author. E-mail addresses:
[email protected] (I. Diaz),
[email protected] (C. Zhang),
[email protected] (L. Hollevoet),
[email protected] (J. Svensson),
[email protected] (J. Rodrigues),
[email protected] (L. Wilhelmsson),
[email protected] (T. Olsson),
[email protected] (L. Van der Perre),
[email protected] (V. Öwall).
ineffective and imposes a high development risk. Furthermore, the multiple chip-set solution makes handover between radio technologies unfeasible. It is well identified that simultaneous support of multi-standard data receptions using flexible hardware platforms is a great challenge. Although some early attempts have been presented in both academia and industry [2,3], experiments so far have been limited to the support of a single data stream. Switching between different standards is only possible through off-line configurations and is conducted by an external host controller. Despite not being reported, configuration time during context switching is envisioned to be on a scale of hundreds of clock cycles, since the host controller has to be interrupted to conduct the loading of appropriate programs/configurations before getting ready for new data receptions. Evidently, this off-line switching approach is highly unacceptable from user’s experience point of view, as terminals are temporarily ‘‘disconnected’’ ever time when entering into a new radio environment. Software Defined Radio (SDR) addresses this issue by abstracting hardware functionality into software oriented instructions [4]5. In this manner, a SDR mobile terminal will be able to move from standard to standard, by dynamically adapting its internal
http://dx.doi.org/10.1016/j.micpro.2015.03.001 0141-9331/Ó 2015 Elsevier B.V. All rights reserved.
Please cite this article in press as: I. Diaz et al., A new digital front-end for flexible reception in software defined radio, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.03.001
2
I. Diaz et al. / Microprocessors and Microsystems xxx (2015) xxx–xxx
structure in accordance to the instantaneous mobility and data rate required. However, there is still a gap to be filled in the Digital Front-End (DFE) where the high throughput makes SDR solutions hard to develop. A brief overview of the Digital Front-End Receiver (DFE-Rx) described here has been already introduced in [6]. The current article takes the work presented in [6] and focuses on detailing the particulars of the crucial blocks of the entire architecture. Namely, the resampler, and the synchronization engine. Additional results are also included, such as performance plots and better visualization of the hardware complexity. The outline of the current article can be described as follows. In Section 2 the multi-standard environment is described, together with the target application, i.e., synchronization during acquisition. In Section 3 an overview of all the blocks in the DFE-Rx are briefly described in order to understand their functionality and inter-connection between various modules of the architecture. In Section 4 the most relevant blocks are described in a more detailed manner, exposing the real value of the architecture. In Section 5 the hardware cost are presented and analyzed. Section 6 presents the results obtained from simulations and measurements. Finally conclusions are presented on Section 7. 2. Multi-standard environment in SDR In the future wireless ubiquity, a mobile terminal will be subject to an uninterrupted multi-standard environment. In this scenario, the ability to adapt to the wide range of connectivity alternatives is essential. 2.1. Standard selection In this study, however, it was decided to narrow down the number of standards to be supported. In order to motivate the final standard selection, take a look at Fig. 1, where some of the most popular standards are classified in relation to its data-rate and user-mobility. On the bottom-left corner are the standards that provide the lower data-rate under stationary conditions, which implies that they are considered as low energy consumers. Different standards cover different areas in the plot that also place different requirements on the terminals supporting them. From the figure it can be seen that Orthogonal Frequency Division Multiplexing (OFDM)-based standards (i.e., Long Term Evolution (LTE), DAB, IEEE 802.11, IEEE 802.16, and DVB-H) are more dynamic, as they cover a larger portion of the plot, in addition
to achieve the highest data-rate in the figure. Thus, supporting concurrent OFDM standards is a basic requirement in a reconfigurable DFE for SDR. Even though there is some research on reconfigurable hardware for multiple standards [8], to the best knowledge of the authors, no attempt has been done in focusing on OFDM-standards alone (taking advantage of their similarities), nor in considering concurrent support. As a proof of concept, it was decided to focus on three OFDM standards that provide complimentary services, namely, Wireless Local Area Network (WLAN) contributing to high data-rate under stationary conditions, LTE for high mobility with moderate datarate and Digital Video Broadcasting-Handheld (DVB-H) providing multimedia broadcasting services. Since the selected standards are OFDM, reuse in several hardware blocks is possible, while differences make reuse non-trivial. 2.2. Target application The DFE is the first stage after the Analog to Digital Converter (ADC) and before the baseband processor, where the first operation in the baseband is the Fast Fourier Transform (FFT). In this part of the terminal one of the most important operations is the synchronization process. The synchronization process is usually performed in time and frequency domain, commonly referred to as acquisition and tracking stage, respectively [9]. The acquisition stage aims to find the start of each OFDM symbol and to perform a rough estimation of Carrier Frequency Offset (CFO). The tracking stage aims to refine the parameters obtained from the acquisition stage. This study focuses on the acquisition stage and assumes that the channel impulse response is shorter than the length of CP. Maximum Likelihood (ML) estimation [10] is commonly used to perform synchronization in OFDM systems. The algorithm is based on either pilots/preamble or CP in OFDM symbols. In the three standards under analysis, CP is present. Besides, IEEE 802.11n contains a preamble, which has specific Short Training Symbols (STSs) designed for data detection and time synchronization [11]. Given that all STSs are identical, the first STS can be considered as the CP of the remaining part in the short training field. Based on either CP or preamble, the ML estimate can be expressed as
b h¼
arg maxn fjc½njg
if jc½nj P T
0 ðestimate not foundÞ otherwise;
ð1Þ
with
c½n ¼
L1 X r½n kr ½n M k;
ð2Þ
k¼0
where r½k is the received data vector, c½n is the output of moving-
Fig. 1. Data rate transmission vs mobility for wireless standards extended from IMEC Scientific Report 2007 [7].
h indicates the estimated symbol start, and ðÞ denotes the sum, b complex conjugate operator. T represents a threshold value, used to find the symbol start by detecting the position of the maximum correlation value. T is adjusted in accordance to different standards and the expected Signal-to-Noise Ratio (SNR). L is the length of the moving-sum operation, and M is the autocorrelation distance, i.e., the number of samples from the start of CP to its corresponding copy within the OFDM symbol. The values of L and M vary among standards and also between different synchronization methods, i.e., CP-based for LTE and DVB-H and preamble-based for IEEE 802.11n. Table 1 summarizes the values of L; M and the number of subcarriers N c for the three standards. In CP-based synchronization, the autocorrelation distance M equals to N c , and the size of the moving-sum L equals to the length of the CP. LTE and DVB-H fall into this category. Since better synchronization accuracy is expected when using preambles, preamble-based approach is used for IEEE 802.11n. In this case, M corresponds to the size of a STS and
Please cite this article in press as: I. Diaz et al., A new digital front-end for flexible reception in software defined radio, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.03.001
I. Diaz et al. / Microprocessors and Microsystems xxx (2015) xxx–xxx Table 1 Comparison for the length of moving-sum L, autocorrelation distance M, and the number of subcarriers N c in the ML-based time synchronization.
IEEE 802.11n 3GPP LTE DVB-H
L
M
Nc
16 9 144 64
16 2048 8192, 4096, 2048
64 2048 8192, 4096, 2048
L equals to the size of remaining 9 STSs (L ¼ 16 9 in Table 1). This is equivalent to computing correlation between neighboring STSs and accumulating results over the entire short training field. A common method to estimate CFO is to divide the offset value into two components, expressed as
Df c ¼ a þ e;
ð3Þ
where a and e represent the integer and fractional part of CFO, respectively. e is normalized with respect to the sub-carrier spacing and is delimited by jej 6 0:5. This study focuses on the computation of the fractional CFO. An approach to estimate e is based on a phase computation of the autocorrelation result at the estimated symbol start c½b h [10], i.e.,
e¼
n o 1 h : arg c½b 2p
ð4Þ
This is usually performed by a Coordinate Rotation Digital Computer (CORDIC) algorithm operating in circular vectoring mode [12]. A fundamental difference between standards that need to be considered in the architecture is the sampling frequency. In the case the three selected standards, their corresponding sampling frequencies are non-integer multiples of one another. This implies that the DFE-Rx should be made flexible to be re-used. When it comes to LTE and WLAN, there is also a major difference in that the data in WLAN is transmitted in bursts, and that the initial part of the burst contains signals useful for synchronization and channel estimation [13,14]. It is assumed that transceivers are relatively stationary, so that the parameters estimated at the beginning will be valid throughout the entire burst. For LTE, the situation is typically the opposite as far as channel estimation is concerned. For LTE the channel needs to be tracked continuously. In the case of DVB-H, there is a continuous transmission of content to the terminal, similar to LTE, plus scattered pilot patterns, with the main difference lying on the fact that DVB-H does not support any multiple-input–multiple-output features [15].
3
3.1. Automatic gain control The Automatic Gain and Resource Activity Controller (AGRAC) serves two major tasks. The first is to control certain front-end settings that require a short control loop, such as front-end gain and DC-offset compensation. The second task is to control which other parts of the DFE-Rx that are active at a certain moment in time. Hence, it also controls the mode of operation of the DFE-Rx. Depending on the standard to be initiated, it will enable filters, receive data buffer and synchronization engine. When the system has synchronized with the recently detected standard, the AGRAC generates the interrupt signal towards the baseband processor. The AGRAC is implemented as a Micro-Controller (lC), compatible with Microchip PIC16F84A. Selecting an architecture that is compliant with an industry-standard has the advantage that the existing tool-chains for application development and debugging can be extensively used. 3.2. Compensation This block performs a compensation for DC-offset and IQ imbalance compensation. The amount of offset present in the incoming signals is estimated and communicated to the compensation module by the AGRAC. The compensation module corrects the error by adjusting the values of the in-phase and quadrature components accordingly. The decimation chain brings the signal to correct sampling frequency in the digital domain. This is done by filtering and down-sampling to the specific standard baseband sample-rate. By adjusting parameters inside the decimation chain, it is possible to digitally change the sample rate of the received signal to the required for each supported standard. 3.3. Decimation chain A multi-standard receiver must obtain the baseband signal at a sample rate given by the received standard. The master clock has a frequency unrelated to the standard’s sample/symbol rate, and is sufficiently high to capture the signal content without aliasing. The decimation chain brings the signal to correct sampling frequency in the digital domain. This is done by filtering and downsampling to the specific standard baseband sample-rate. By adjusting parameters inside the decimation chain, it is possible to digitally change the sample rate of the received signal to the required for each supported standard. This block is considered of extreme importance in the architecture and it is explained in further detail in the following section. 3.4. Synchronization engine
3. Architecture overview In order to cope with the differences between the different standards, when it comes to synchronization, the architecture modules contribute with different levels of configuration. For example, control, filtering and down-sampling, are very specific to each individual standard and need to be operational during the entire transmission, while the synchronization engine is required only during the first stage of connectivity. Thus, the proposed DFE-Rx needs a highly reconfigurable data-path per standard and a single synchronization engine linking each data-path. This architecture is graphically illustrated in Fig. 2, where a maximum of two concurrent standards (LTE, DVB-H, or WLAN) are supported concurrently. However, since not 100% of the architecture is constantly used, more concurrent standards could be supported. The architectural building blocks are described in the following subsections.
A reconfigurable array is constructed from a mesh of heterogeneous resource cells, communicating through a combination of local interconnections with dedicated wires and a global hierarchical routing network. To meet different computation demands for the operations given by the standard’s synchronization. The flexible hardware architecture allows the user to allocate different function blocks in cell array, comprising both application-tuned or general-purpose processors and data memories. This module is considered the most important block in the entire DFE and it will be described in more detail in the coming section. 3.5. Buffer and bus interface The DFE-Rx communicates with an SDR baseband processor with high speed communication interfaces. Thus, the DFE-Rx needs an advanced bus interface to be able to stream the data processed
Please cite this article in press as: I. Diaz et al., A new digital front-end for flexible reception in software defined radio, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.03.001
4
I. Diaz et al. / Microprocessors and Microsystems xxx (2015) xxx–xxx
Fig. 2. DFE-Rx top level.
by the DFE-Rx that is temporarily stored in the buffer. The need for this buffer arises from the fact that WLAN communicates in bursts. Hence, every sample contains relevant unrecoverable information during the initial connection stages. 4. DFE-Rx’s crucial modules in detail In the previous section the overview of the entire architecture was presented, in this section however, the focus is mainly in those two modules considered the most crucial in the entire DFE-Rx. Namely, the decimation chain, and the synchronization engine. The flexibility of the architecture departs from the construction of these two modules, and thus, it needs further examinations. 4.1. Decimation chain The baseband signal must be translated from any sample rate to that at a sample rate given by the received standard. The master clock has a frequency f 1 unrelated to the standard’s sample/symbol rate f 2 , and is sufficiently high to capture the signal content without aliasing. 4.1.1. Sample rate conversion Sample rate conversion from f 1 to f 2 comprises calculating new samples spaced T 2 ¼ 1=f 2 apart from old samples spaced T 1 ¼ 1=f 1 apart. The new sampling instants thus fall somewhere in between the old sampling instants. The resampling problem is thus related to estimating the value of a sampled signal at arbitrary positions in-between old sampling positions. For a single carrier system the actual sampling instants could be important and also this can be handled by a Farrow filter. However in an OFDM system it is usually sufficient to get the sample rate correct and the actual sampling instants are less important. Henceforth, the relative position in-between two consecutive old samples is given by in the range 0:5 < l0:5, where l ¼ 0:5 is the position of the earliest of the two samples, l ¼ 0:5 is the position of the latest of the two samples, and l ¼ 0 is the position in the middle of the two samples. For any fixed value of l a fixed coefficient filter can be designed that uses N old samples surrounding the sampling instant of the new sample. The only case considered is where N, i.e. the number of subcarriers, is even and N=2 old samples on each side of the new sample are used. 4.1.2. Conversion performance The performance of floating point Farrow filters has been evaluated both by theoretical, calculations and by simulations. Figs. 3 and 4 are examples of performance evaluations where the output SNR for different relative input bandwidths has been determined
both by calculation and by simulation. Based on these results, the selected Farrow resampler uses length N = 8, and polynomial order M = 3. The simulations show that the relative input bandwidth should preferably be less than around 0.7, and that if the noise is not suppressed outside the useful signal then around 0.4 is the minimum desired relative input bandwidth. For each standard the relative bandwidth can be calculated as number of used carriers divided by total number of subcarriers (the FFT size). The constraint of the input bandwidth to the resampler then gives the minimum input sample rate, and if noise is not suppressed, the maximum sample rate, as summarized in Table 2. 4.1.3. Decimation’s architecture The decimation chain block consists of programmable filters that remove unwanted signals and keep only the desired signal, and a Farrow resampler can adjust the sample rate of the signal by an irrational factor [16], this factor depends strongly on the configuration of which standards are being received. The block diagram of the decimation chain with its main components is illustrated in Fig. 5. The flexible filter consists of a symmetric Finite Impulse Response (FIR) filter with 32 taps, the architecture is folded in order to reduce the number of multipliers to 4, with coefficients dynamically changed in dependence of the required standard configuration. The Farrow resampler achieves the sampling transformation by a combination of filter banks and polynomial evaluation, where polynomial and filter coefficients are manipulated in relation to the current standard configuration. 4.2. Synchronization engine The baseline architecture for implementing the multi-standard synchronization is based on a dynamically reconfigurable Coarse Grain Reconfigurable Architecture (CGRA) [17]. The reconfigurable array is constructed from a mesh of heterogeneous resource cells, communicating through a combination of local interconnections with dedicated wires and a global hierarchical routing network. To meet different computation demands, the flexible hardware architecture allows the user to allocate different function blocks in cell array, comprising both application-tuned or general-purpose processors and data memories. 4.2.1. Algorithm’s mapping and performance Given the algorithm needed for synchronization operations in the synchronization engine are partitioned into three main processing blocks: data correlation, peak detection, and CFO estimation. Different design parameters in the three radio standards set different hardware requirements. The size of the correlation FIFO,
Please cite this article in press as: I. Diaz et al., A new digital front-end for flexible reception in software defined radio, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.03.001
I. Diaz et al. / Microprocessors and Microsystems xxx (2015) xxx–xxx
5
Fig. 3. Calculated (line) and simulated (marker) performance for Farrow resampler for different sizes. Optimized for infinite SNR and relative input bandwidth 0.3.
Fig. 4. Calculated (line) and simulated (marker) performance for Farrow resampler for different sizes. Optimized for infinite SNR and relative input bandwidth 0.6.
Table 2 Summary of bandwidths and sampling rates.
Rel. out. BW Fs2 [MHz] Min Fs1 [MHz] Max Fs1 [MHz]
LTE
802.11n @40 MHz BW
DVB-H
0:59 30:72 25:71 36
0:89 40 50:893 71:25
0:83 9:143 10:869 15:217
as an example, varies from 16 to 2048 samples for IEEE 802.11n and DVB-H 2K, respectively. Input samples are 12-bit complex numbers. To reduce memory requirements during the correlation computation, data samples in single-stream mode are truncated down to 4 bits. This relies on an assumption that performance of the synchronization in the acquisition stage only needs to be sufficiently accurate in order to help refined estimation algorithms in the tracking stage work properly [18]. During concurrent multi-stream processing, memories are shared between two data streams and the word-length of data samples is further reduced by half. As an example, performance analysis of CFO estimation with respect to different input data truncation is shown in Fig. 6. The Mean Square Error (MSE) of
estimated frequency offset is simulated for an Additive White Gaussian Noise (AWGN) channel on an LTE transmission with a frequency offset of p=8. Although higher data word-length attains higher processing accuracy, larger input truncation reduces both hardware complexity and memory size. For example, reducing input word-length from 8 to 4 bits results in a performance degradation of around 0.66 103 radians at a SNR of 10 dB. This corresponds to a frequency error of 0:66 103 =ð2pÞ Df ¼ 1:58 Hz. Further truncation to 2 bits results in 75% memory reduction at the cost of 87.26 Hz frequency error at the same SNR. Since a maximum frequency error of 2 kHz is tolerable by the receiver in LTE [19], this word-length reduction is motivated. The same analysis is applied to other radio standards, and the results show that quantization noise due to word-length reduction is negligible. 4.2.2. Synchronization engine architecture Fig. 7 shows a block diagram of the reconfigurable cell array. The cell array is configured to have two processing and two memory cells. The processing cells are used to perform data operations, while the memory cells serve as correlation and moving-sum FIFOs as well as communication buffers between processors. The interface controller, connected to the cell array via hierarchical network
Please cite this article in press as: I. Diaz et al., A new digital front-end for flexible reception in software defined radio, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.03.001
6
I. Diaz et al. / Microprocessors and Microsystems xxx (2015) xxx–xxx
Fig. 5. Block diagram of the decimation chain consisting of a flexible filter and a farrow resampler. The filter is a 32 taps symmetric FIR filter with configurable impulse response. The farrow resampler contains configurable coefficients filter bank and dynamic polynomial evaluation.
Fig. 6. Analysis of input data truncation in CFO estimation for 3GPP LTE with a frequency offset of p=8.
Fig. 7. Block diagram of the 2 2 cell array and the interface controller deployed in the synchronization block of DFE-Rx. Solid and dashed lines depict local and hierarchical network interconnects, respectively.
interconnects, manages external data communication to other system blocks in the DFE-Rx and is responsible for static configurations of RCs. Data flow processing. Concurrent processing calls for a processor design that suffices different computational requirements on each individual data stream. Processing cells are Reduced Instruction Set Computing (RISC) cores with improved dataflow control. In addition to the functions equipped in a generic processing cell, the dataflow processor enhances data processing by supporting Single Instruction Multiple Data (SIMD)-like operations. The processor contains multiple processing lanes, illustrated in Fig. 8, capable of performing both complex- and real-valued operations. These operations are required by, for example, ML estimation
and CORDIC computations, respectively, which are crucial for the algorithm used for synchronization. Taking a 16-bit 4-lane processor as an example, the processing lanes can be grouped into 2 or 4 computation paths, capable of executing 8-bit complex-valued or 16-bit real-valued operations. Fig. 9 depicts detailed architecture of the arithmetic part of Arithmetic Logic Unit (ALU) in the 16bit processor. Basic operations of the ALU are controlled by two mode specifiers, ‘‘multiplication’’ and ‘‘vector’’. While the former one switches between addition and multiplication mode, the latter one controls real- or complex-valued operations. Real-valued output is obtained by concatenating results from ‘O3 ’ and ‘O4 ’, while the real and imaginary part of complex-valued output are taken from ‘O1 ’ and ‘O4 ’, respectively. In addition to the SIMD-like operations, computational units are extended to both ‘‘instruction decode’’ and ‘‘write back’’ stage of the processor. As a result, several consecutive data manipulations can be accomplished in a single instruction execution without storing intermediate results. This substantially reduces register accesses. Moreover, each arithmetic- and logic-type instruction in the dataflow processor is extended to have two operation codes (opcodes), capable of performing two different operations on the same input data operands in each clock cycle. The widely used butterfly operation (simultaneous add and subtract [20]) in FFT is a typical example of the dual-opcode instruction. This can be used to hide the execution time of data movement operations. Memory cell. Memory descriptors are shared by the processing of multiple data streams. To cope with various sample-rate of standards, it is crucial that memory descriptors can be executed in a non-sequential order. Otherwise, data stream with the slowest sample-rate will block entire data processing. Thus, memory descriptors are extended in a way that they can be configure to execute either in non-blocking or blocking mode. In non-blocking mode, the operation controller of the memory cell sequentially starts a descriptor execution in each clock cycle, without waiting for response from data receiver regarding last memory access. Therefore, subsequent descriptors can still be issued and executed even if the current one is being blocked. Besides used in multistream processing, the non-blocking execution mode is also useful when one memory cell is shared among several hosts (e.g. processing cells) operating on different stream transfer rates. In contrast, blocking execution mode guarantees the completion of each specified memory access before starting a new descriptor execution. This mode can be used to avoid mixing up stream transfers when an I/O port is shared among multiple memory descriptors. To further improve the flexibility of memory cells, the order of the descriptor execution is run-time programmable. This way, multiple descriptors can be arranged to reorder or repeat data sequences, or to cope with data streams that have different transfer rates. Fig. 10 illustrates the use of descriptor execution program
Please cite this article in press as: I. Diaz et al., A new digital front-end for flexible reception in software defined radio, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.03.001
I. Diaz et al. / Microprocessors and Microsystems xxx (2015) xxx–xxx
7
Fig. 8. Computation path of the dataflow processor. Configurations of the shaded blocks are stored in registers that are run-time accessible.
Fig. 9. Arithmetic part of the ALU in the dataflow processor, an example of 16-bit case. Real-valued output is taken from ‘O3 ’ and ‘O4 ’, while complex-valued output is drawn from ‘O1 ’ and ‘O4 ’.
during concurrent multi-stream processing. Assuming that the memory cell has four descriptors, which are configured to serve for two different streams in an interleaved manner, namely { and s for ‘‘stream 1’’ and r and t for ‘‘stream 2’’. During the processing of IEEE 802.11n and LTE, which both have an oversampling rate of 1, the four descriptors are executed sequentially. However, when dealing with IEEE 802.11n and DVB-H, execution sequence needs to be programmed in a way that data stream of IEEE 802.11n is processed four times before performing one DVB-H data reception. In addition to the flexible descriptor execution, the data access pattern of a memory cell can be reshaped by using a micro-block function. This enables memory access with finer word-length than a physical memory provides. For example, a 32-bit wide memory cell can be configurable to behave as two 16-bit wide or four 8bit wide memory cells. This feature is useful when supporting multi-standard data processing, as different standards may intrinsically require different processing word length. A micro-block operation is defined by a block size, stride, read and write pointer, and data mask. The block size specifies the word-length of a micro-block, used to determine the number of data accesses in each memory read and write operation. For a 32-bit memory cell, options of the micro-block size are 1, 2, 4, 8, 16, and 32 bits. Stride is the distance, measured in bits, to the next micro-block. Read/write pointers are physical memory addresses
and are automatically updated after each operation. Data mask enables bitwise operation on data that is read from or to be written to the memory. 5. Synthesis and verification Fabricated in a 65 nm CMOS technology, the DFE-Rx has a die size of 5 mm2 with 144 I/O pads. According to synthesis results, half of the chip area is taken by memories, logic cells and I/O pads, while the remaining half is used for power and signal routing as well as clock tree generation. Figs. 11 and 12 show the chip’s layout and it’s verification board, respectively. As a prototype, the design is pad-limited due to the large number of I/O ports required for individual tests of function blocks. 5.1. Area breakdown Tables 3–5 show the area breakdown of the DFE-Rx. As can be seen, I/O pads occupy about 40% of the area and the remaining part is evenly distributed among the synchronization block and the two receiving data paths. In the following, we focus on the implementation of the synchronization block, namely the 2 2 cell array, and present silicon measurement results obtained from a standalone test.
Please cite this article in press as: I. Diaz et al., A new digital front-end for flexible reception in software defined radio, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.03.001
8
I. Diaz et al. / Microprocessors and Microsystems xxx (2015) xxx–xxx
Descriptor table
Stream 2 Stream 1 Stream 2
Descriptor execution sequence Str-1: 802.11n Str-2: LTE Str-1: 802.11n Str-2: DVB-H
{ {
Stream 1
Stream 1
Stream 1
Fig. 10. Illustration of descriptor execution program during concurrent multi-stream processing.
2.2mm
Table 3 Area occupied by a single DFE-RX. DFE Rx
Area [lm2]
Share [%]
Input/Output Pads Synchronization Engine DataPath Standard 1 DataPath Standard 2 Databridge Others
991,569 479,026 501,894 501,894 50,676 59,175
38.37 18.54 19.42 19.42 1.96 2.29
Total
2,584,234
100.00
Table 4 Area occupied by the Data Path for a single standard path excluding synchronization engine.
2.2mm Fig. 11. Final layout of the entire DFE-Rx, with 144 pads occupying a die area of 5 mm2.
Single Data Path
Area [lm2]
Share [%]
AGRAC Compensation resampler Buffer & Host Int Debug Module Others
204,572 6034 69,333 137,623 75,774 8557
40.76 1.20 13.81 27.42 15.10 1.70
Total
501,894
100.00
Table 5 Area occupied by the synchronization engine based on CGRA for reception of one or two concurrent standards.
Fig. 12. DFE-Rx system verification setup, connected to FPGA and computer via PCI interface.
According to synthesis results, 2.6 mm2 of the design is used for memories, logic-related cells and pads, while the remaining area is used for power and signal routing. Looking into each data path area distribution, it can be seen from Table 4 that the largest share of the data-path is occupied by the AGRAC with 40% share (internal memories included), which is no surprise given that this module is a functional lC with all the flexibility offered by its architecture, in addition, it requires a module for programming the lC. Buffers and the host interface have the second largest area share in this hierarchical level with 27%, due to the large memories acting as buffers. In addition the host interface is capable of performing Direct Memory Access (DMA) operations to the baseband processor, requiring relatively large number of resources. Table 5 shows the area break-down of the synchronization engine, where most is occupied by memories with a total share of 42% of the area, i.e., 13.78%, 12.04%, 7.73%, and 8.63% for the
Synchronization engine
Area [lm2]
Share [%]
Router control Router data Memory Cell 0 Memory Cell 0 RAM Memory Cell 1 Memory Cell 1 RAM Internal Comm Buffer Processing Cell 0 Processing Cell 0 RAM Processing Cell 1 Processing Cell 1 RAM Interface controller
18,968 19,008 39,167 66,016 39,169 57,657 25,640 37,567 37,009 37,201 41,318 60,320
3.96 3.97 8.18 13.78 8.18 12.04 5.35 7.84 7.73 7.77 8.63 12.59
Total
479,040
100.00
memories of memory cell 0, memory cell 1, processing cell 0, and processing cell 1, respectively. The second largest contributor is an interface controller, in charge of communicating with the remaining parts of the DFE-Rx, with 12.59% area share. A better grasp of the area share can be visually inspected in Fig. 13, where all elements in the DFE are pictured scaled according to its area contribution. Note, that elements in the Data-Path 1 are identical to those in Data-Path 2, and for sake of simplicity, are not shown. 5.2. Verification Two verification strategies were developed to test the DFE-Rx as a fully-functional system and to test the synchronization engine as stand alone. For these two strategies, two verification boards were
Please cite this article in press as: I. Diaz et al., A new digital front-end for flexible reception in software defined radio, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.03.001
9
I. Diaz et al. / Microprocessors and Microsystems xxx (2015) xxx–xxx
Fig. 13. DFE-Rx graphical visualization of area break-down.
Fig. 14. Synchronization engine stand-alone verification setup, communication is done via serial interface under lower operational frequency and communicated to computer via serial port.
Power consumption Synchronization Engine
2
3
4
Idle
Power [mW]
1
Time [samples] Fig. 15. Measured power consumption of the cell array in a standalone test mode when processing an IEEE 802.11n data reception.
created. Fig. 12 shows the system verification board, where the DFERx is connected to an Field-Programmable Gate Array (FPGA) and further interfaced via another FPGA to a computer for data visualization. Furthermore, to verify the functionality of the cell array, a standalone test is carried out in the debugging mode of the DFE-Rx via an on-chip Serial DeBug (SDBG) interface. The SDBG interface contains a set of light-weight single-ended serial links capable of
operating at 10 Mbps when using ribbon cable connections. Higher speed, up to 40 Mbps, may be achieved with good signal termination and PCB board layout. Typical high speed data-recovery circuits require a PhaseLocked Loop (PLL) module for each serial link to recover data (as well as clock) from an 8b/10b encoded serial link. However, data can be recovered from a serial link by simply using a 4 clock sampling scheme without the 8b/10b encoding. Additionally, instead of using PLL, the presented data-recovery circuit employs two local clock signals: ‘clk’ and its 90-degree phase shifted counterpart ‘clk90’. These clock signals can be generated from two independent local oscillators or clock generation circuits. However, it is important to maintain the phase-relationship of the two clocks, and serial links are implemented with single-ended I/O pads. Both clock signals (‘clk’ and ‘clk90’) are provided from normal clock pads and are directly used inside the chip without further phaseadjustment. Table 6 Area occupied by synchronization engine made out of hardware accelerators for a single standard reception. Synch accelerator
Area [lm2]
Share [%]
Correlation FIFO Config module Complex multiplier Dual moving sum Peak detector Others CORDIC time-multiplexed
27,232 42 1296 25,258 2305 1479 3330
44.69 0.07 2.13 41.45 3.78 2.43 5.46
Total
60,942
100
Please cite this article in press as: I. Diaz et al., A new digital front-end for flexible reception in software defined radio, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.03.001
10
I. Diaz et al. / Microprocessors and Microsystems xxx (2015) xxx–xxx
Fig. 16. Area overhead of synchronization engine solution over accelerator solution shown for synchronization engine only, and full DFE-Rx.
The DFE-Rx SDBG consists of three serial links: ASIC Control Input Link (AIL-C), ASIC Data Input Link (AIL-D), and ASIC Output Link (AOL). The AIL-C and AIL-D are used to stream configurations and data into the cell array respectively, while AOL is shared for both data and control outputs. Through the SDBG interface, the cell array is connected to an FPGA platform, Xilinx XUPV5-LX110T, which implements the control and data streaming logics for communicating with the cell array. Fig. 14 illustrates the setup and the measurement testbed for the standalone test, where the synchronization engine is directly accessed by bypassing other elements inside the DFE-Rx and an stand alone test is performed via a serial interface running at 10 MHz 6. Results Fig. 15 shows the power consumption of the cell array measured under the processing of an IEEE 802.11n data stream at nominal supply voltage of 1.2 V and at 10 MHz clock frequency. During the reception of 802.11n data frames, the measured minimum and maximum power consumption is 1.75 mW and 2.19 mW, respectively. During the loading of hardware configurations (not shown in Fig. 15), the cell array consumes 1.95 mW power. In order to compare the hardware complexity imposed by the DFE, an equivalent accelerator solution for synchronization was synthesized. The high reconfiguration evidently increases complexity. In order to measure this complexity overhead, the synchronization engine has been synthesized in a CMOS 65 nm process and compared to that of the provided with only hardware accelerators. The synthesis results of the accelerator based synchronization is shown in Table 6. By examining Tables 5 and 6, it is not possible to compare module by module, since the CGRA has multiple functionality. However, it is evident that both design are memory dominant, i.e., correlation FIFO and dual moving sum in the case of the Table 7 Sync Engine Maximum speeds after synthesis and number of standards to be capable of supporting if running at that speed.
⁄
Tech [nm]
CA max freq [MHz]
LTE @20 MHz BW
WLAN @20 MHz BW
WLAN @40 MHz BW
DVB-H 2 K mode
130 90 65 40 28
267⁄ 385⁄ 534 867⁄ 1239⁄
2 3 4 7 10
3 4 6 10 15
1 2 3 5 7
7 10 14 23 33
Scaled from 65 nm with scaling rules: A 1/s2 and T (1/s).
accelerator and the all the Random Access Memories (RAM) elements in the CGRA solution. In terms of functionality, two accelerator-solutions are equivalent to a single CGRA-solution, then it can be seen that for a reception of two concurrent standards the area overhead is a factor 2.7, i.e., the CGRA is almost 3 times larger than its accelerator counterpart. However this overhead is reduced to only 1.2 when considering the entire DFE-Rx. Fig. 16 illustrates the hypothetical scenario where the number of concurrent standards needed in a specific terminals ranges to a total of 32. Even though this is scenario is unpractical, serves as reference to find the point where the overhead is no longer too costly, this point is reached when a minimum of 12 concurrent standards are needed. Also from Fig. 16 it can be seen that in the system perspective, the overhead does not vary much by increasing the number of concurrent standards, since the area-dominant parts of the DFE-Rx have to be duplicated per standard, i.e., all the elements in Table 4. Moreover, the CGRA solution can be more attractive when considering technology scaling, this is depicted in Table 7, where the maximum speed extracted from synthesis has been scaled to various CMOS technologies. Hypothetically assuming that the fabricated synchronization engine can be clocked at this maximum speed, the number of concurrent standards that the synchronization engine can support (given each standard requirements) is shown in the table. Even though the application of 33 concurrent DVB-H receiver signals is hard to imagine, the scalability potential provided by the CGRA is evident. The synchronization engine stand alone verification provides a clear indication on the concept’s feasibility. Measurements were performed on the synchronization engine receiving WLAN standard with a core supply of 1.2 V. The measurements is depicted in Fig. 15 where four consecutive patterns of WLAN are input to the synchronization engine. 7. Conclusions Out of the analysis on different algorithms and hardware implementations explored in this article, it can be concluded that the future SDR terminals will require a non-traditional design strategy for the digital front-end. Algorithm-architecture co-design plays a critical role in the design of future terminals since the level of flexibility required by the mobiles of the future, typical hardware accelerators, DSPs or general purpose processor are not sufficient as stand alone solutions. When a mobile terminal is subject to support multiple-streams concurrently, the critical points becomes an architecture that has good flexibility with limited overhead, as the proposed DFE-Rx presented. Hardware complexity was explored by comparing the proposed architecture to an equivalent accelerator. Even though the CGRA synchronization engine results in an overhead, its benefits in flexibility and scalability are evident. As the number of concurrent standards increases, this overhead becomes negligible, more so when the entire system is placed into perspective. Moreover, verification of the synchronization engine, the main element architectural element, was performed and measurements reported an average power dissipation of 1.9 mW when receiving WLAN standard, measured at 1.2 V. References [1] The World in 2013, Facts and Figures, ITU/ICT, 2013. [2] C. Ebeling, C. Fisher, G. Xing, M. Shen, H. Liu, Implementing an OFDM receiver on the RaPiD reconfigurable architecture, IEEE Trans. Comput. 53 (11) (2004) 1436–1448, http://dx.doi.org/10.1109/TC.2004.98. ISSN 0018-9340. [3] A. Baschirotto et al., Baseband analog front-end and digital back-end for reconfigurable multi-standard terminals, IEEE Circ. Syst. Mag. 6 (1) (2006) 8– 28, http://dx.doi.org/10.1109/MCAS.2006.1607635. ISSN 1531-636X.
Please cite this article in press as: I. Diaz et al., A new digital front-end for flexible reception in software defined radio, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.03.001
I. Diaz et al. / Microprocessors and Microsystems xxx (2015) xxx–xxx [4] K. van Berkel, F. Heinle, P.P.E. Meuwissen, K. Moerman, M. Weiss, Vector processing as an enabler for software-defined radio in handheld devices, EURASIP J. Appl. Signal Process. 2005 (2005) 2613–2625, http://dx.doi.org/ 10.1155/ASP.2005.2613. ISSN 1110-8657. [5] Y. Lin, Y.L. Hyunseok, M. Woh, Y. Harel, S. Mahlke, T. Mudge, C. Chakrabarti, SODA: a low-power architecture for software radio, in: Proc. of the 33rd Annual International Symposium on Computer Architecture, 2006, pp. 89–101. [6] I. Diaz, C. Zhang, L. Hollevoet, J. Svensson, J. Rodrigues, L. Wilhclmsson, T. Olssson, L. Van der Pcrre, V. Owall, Nex generation digital front-end for multistandard concurrent reception, in: NORCHIP, 2013, pp. 1–6. http://dx.doi.org/ 10.1109/NORCHIP.2013.6702041. [7] IMEC, Next Generation Air Interfaces, Tech. Rep., IMEC, 2007. [8] A. Baschirotto, R. Castello, F. Campi, G. Cesura, M. Toma, R. Guerrieri, R. Lodi, L. Lavagno, P. Malcovati, Baseband analog front-end and digital back-end for reconfigurable multi-standard terminals, IEEE Circ. Syst. Mag. 6 (1) (2006) 8– 28, http://dx.doi.org/10.1109/MCAS.2006.1607635. ISSN 1531-636X. [9] M. Speth, S. Fechtel, G. Fock, H. Meyr, Optimum receiver design for OFDMbased broadband transmission – Part II: a case study, IEEE Trans. Commun. 49 (4) (2001) 571–578, http://dx.doi.org/10.1109/26.917759. ISSN 0090-6778. [10] J.J. van de Beek, M. Sandell, P.O. Borjesson, ML estimation of time and frequency offset in OFDM systems, IEEE Trans. Signal Process. 45 (7) (1997) 1800–1805, http://dx.doi.org/10.1109/78.599949. ISSN 1053-587X. [11] X. Yang, IEEE 802. 11n: enhancements for higher throughput in wireless lans, IEEE Wireless Commun. 12 (6) (2005) 82–91, http://dx.doi.org/10.1109/ MWC.2005.1561948. ISSN 1536-1284. [12] J.L. Hennessy, D.A. Patterson, Computer Architecture: A Quantitative Approach, Elsevier, 2012. [13] 3GPP, TS 36.211 V8.3.0 Physical Channels and Modulation, Tech. Rep., 3rd Generation Partnership Project, 2008. [14] IEEE, IEEE P802.11N/D2.00, Tech. Rep., IEEE LAN/MAN Standards Committee, 2007. [15] ETSI, ETSI EN 300 744: Digital Video Broadcasting (DVB); Framing Structure, Channel Coding and Modulation for Digital Terrestrial Television (DVB-T), Tech. Rep., ETSI, 2004. [16] C.W. Farrow, A continuously variable digital delay element, in: IEEE International Symposium on Circuits and Systems (ISCAS), vol. 3, 1988, 2641–2645. http://dx.doi.org/10.1109/ISCAS.1988.15483. [17] T. Lenart, Design of Reconfigurable Hardware Architectures for Real-time Applications: Modeling and Implementation, Ph.D. thesis, Lund University, 2008. [18] I. Diaz, L. Wilhelmsson, J. Rodrigues, J. Löfgren, T. Olsson, V. Öwall, A sign-bit auto-correlation architecture for fractional frequency offset estimation in OFDM, in: Proceedings of 2010 IEEE International Symposium on Circuits and Systems (ISCAS), 2010, pp. 3765–3768. http://dx.doi.org/10.1109/ISCAS.2010. 5537730. [19] L. Wilhelmsson, I. Diaz, T. Olsson, V. Öwall, Performance analysis of sign-based pre-FFT synchronization in OFDM systems, in: 71st Vehicular Technology Conference. [20] J.W. Cooley, J.W. Tukey, An algorithm for the machine calculation of complex Fourier series, Math. Comput. 19 (90) (1965) 297–301.
Isael Diaz was born in Acapulco, Mexico, in 1980. He received his Bachelor’s Degree in Electronic System’s Engineering from Tecnológico de Monterrey in Mexico in 2003, his Master’s and PhD degrees from Lund University in 2007 and 2014, respectively. His main research interests are algorithm implementation of algorithms for baseband processing with Wireless multi-standard support.
Chenxin Zhang (S’09) received the M.Sc. and Ph.D. degrees in electrical engineering from the Department of Electrical and Information Technology, Lund University, Sweden, in 2009 and 2014, respectively. During October 2012 to February 2013, he was a visiting scholar at the Department of Electrical Engineering, University of California, Los Angeles. His research mainly focuses on developments of reconfigurable architectures for high computing performance and runtime flexible task mappings.
11
Lieven Hollevoet received the engineering degree in electronics from the Polytechnic University in Ostend, Belgium in 2000. That year he joined imec as a member of the CSI division where he took care of implementation and transfer of knowledge from imec to SMEs. In 2003 he enrolled in the wireless group at imec where he worked on the definition, implementation and testing of software defined radio solutions. In this function he was the technical responsible for the team that realized worlds first low-power digital frontend for sensing ASIC. In 2014 he became co-owner of the engineering company Quicksand.
Jim Svensson was born in Sweden in 1979. He received the M.Sc. degree in electrical engineering from Lund University, Sweden, in 2003. Jim has been employed at Ericsson AB, Sweden since 2003. Working with Ericsson Research from 2005 to 2014, and currently with product development. He has worked in areas relating to OFDM systems, signal processing, analog, and digital implementation.
Joachim Neves Rodrigues (S-00M-05SM-11) received the Ph.D. degree in circuit design from Lund University, Lund, Sweden, in 2005. Currently, he is an Associate Professor in the Department of Electrical and Information Technology, Lund University. From 2005 to 2008, he acted as ASIC process lead in the Digital ASIC Department at Ericsson Mobile Platforms, Lund, Sweden. He rejoined his current department in 2008, and is currently the Program Director for System-onChip. His main research interests are modeling and implementation of digital and mixed-mode microelectronics, architectures for high performance ultra-low voltage designs, with a focus on biomedical circuits and systems. Dr. Rodrigues is a technical committee member of the Biomedical Circuits and Systems Society since 2010, and Chair of the Swedish SSC chapter.
Leif Wilhelmsson (SM IEEE) received the M.S. in Electrical Engineering and the Ph.D. in Telecommunication Theory from Lund University, Lund, Sweden in 1992 and 1998, respectively. Since 1998 he has been with Ericsson Research in Lund. His research interests include digital communication, short range wireless systems, error correcting coding, multi-standard coexistence, and practical aspects of digital and analog implementations. He is the named inventor of 55 granted US patents and in 2007 he received the Ericsson award ‘‘Inventor of the year’’.
Thomas Olsson received his PhD in Digital ASIC design from Lund University, Sweden in 2004. In 2004, he joined Ericsson AB, where he had a position as Senior Researcher until December 2014. Since January 2015 he is heading a section within Ericsson Research. At Ericsson Research he is working mainly with circuit design and system implementation. During 2008 to 2011 he took part in the European FP-7 research project MultiBase.
Please cite this article in press as: I. Diaz et al., A new digital front-end for flexible reception in software defined radio, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.03.001
12
I. Diaz et al. / Microprocessors and Microsystems xxx (2015) xxx–xxx
Liesbet Van der Perre received the M.Sc. degree in Electrical Engineering from the K.U. Leuven, Belgium, in 1992. The research for her thesis was completed at the Ecole Nationale Superieure de Telecommunications in Paris. She graduated summa cum laude with a PhD degree in electrical engineering form the same university in 1997. Her work in the past focused on radio propagation modelling, system design and digital modems for highspeed wireless communications. In the past, she held position as project manager, program manager, and program director for imecs broadband wireless R&D. Currently she is a director at imec Academy with the mission to offer learning for excellence for the imec community and partners. Prof. Dr. Ir. Van der Perre is a professor at the K.U. Leuven. She’s an author and coauthor of over 250 scientific publications published in conference proceedings, journals, and books. She is appointed honorary doctor at the Faculty of Engineering LTH, Lund University, in 2015.
and architectures for wireless communication and biomedical applications. Research projects include combining theoretical research with hardware implementation aspects in the areas of wireless communication, video processing, and digital holography. Dr. Öwall was an Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-II: ANALOG AND DIGITAL SIGNAL PROCESSING from 2000 to 2002 and of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS from 2007 to 2009.
Viktor Öwall (M90) received the M.Sc. and Ph.D. degrees in electrical engineering from Lund University, Lund, Sweden, in 1988 and 1994, respectively. During 1995 to 1996, he joined the Electrical Engineering Department, the University of California at Los Angeles as a Postdoc where he mainly worked in the field of multimedia simulations. Since 1996, he has been with the Department of Electrical and Information Technology, Lund University, Lund, Sweden. He is currently full Professor and since 2015 the Dean of the Faculty of Engineering. He was the founder of the Director of the VINNOVA Industrial Excellence Center in System Design on Silicon (SoS) which he headed until 2014. His main research interest is in the field of digital hardware implementation, especially algorithms
Please cite this article in press as: I. Diaz et al., A new digital front-end for flexible reception in software defined radio, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.03.001