s FPGA based Viterbi decoder for 4D 8PSK TCM

s FPGA based Viterbi decoder for 4D 8PSK TCM

Digital Signal Processing 20 (2010) 263–268 Contents lists available at ScienceDirect Digital Signal Processing www.elsevier.com/locate/dsp 1.672 G...

476KB Sizes 1 Downloads 67 Views

Digital Signal Processing 20 (2010) 263–268

Contents lists available at ScienceDirect

Digital Signal Processing www.elsevier.com/locate/dsp

1.672 Gigabits/s FPGA based Viterbi decoder for 4D 8PSK TCM A.T.M. Anishur Rahman ∗ , W.G. Cowley Institute for Telecommunications Research, University of South Australia, Adelaide, Australia

a r t i c l e

i n f o

a b s t r a c t

Article history: Available online 6 June 2009 Keywords: Trellis Coded Modulation (TCM) Spectral efficiency Parallel Transition Decoding Trellis (PTDT) Earth Exploration Satellite (EES) Field Programmable Gate Arrays (FPGA)

In this paper a novel Viterbi decoder architecture for 4D 8PSK TCM scheme is presented and analyzed. The design includes a new three RAM based traceback unit. With some modification this architecture can be used for different spectral efficiencies of 4D 8PSK TCM schemes as available in the Consultative Committee for the Space Data Systems (CCSDS) standard for the Earth Exploration Satellites (EES). The current design supports data rates up to 1672 MBit/s. Required hardware resources are shown, plus simulation results and measurement from the real-time decoder are discussed. © 2009 Elsevier Inc. All rights reserved.

1. Introduction High data rate satellite communication is a requirement for EES which transmit data in ten’s of Megabits per seconds. Similarly, reliable communication is another requirement of EES or for almost all modern communication systems and can be achieved by means of channel coding. The implication of the high data rate transmission and reliable communication through channel coding is that they require more bandwidth, a scarce resource that must be used as sparingly as possible. Hence, any technique capable of saving this resource is very important. Toward this direction, CCSDS standard for EES [1] is useful. This standard recommends using 4D 8PSK TCM schemes as the channel coding/modulation technique that can provide enhanced reliability without increasing signal bandwidth. In this standard, there are four different varieties of 4D 8PSK TCM schemes with spectral efficiencies of 2 Bits/symbol, 2.25 Bits/symbol, 2.50 Bits/symbol and 2.75 Bits/symbol. Irrespective of the spectral efficiencies, the embedded convolutional encoder is a rate 3/4 encoder of constraint length 7 (sixty four states). The Viterbi decoder architecture proposed in this paper is customized for 2.75 Bits/symbol and the corresponding encoder and mapper is shown in Fig. 1. Other options of the CCSDS standard for EES can be implemented with some modification to the proposed architecture. The relevant theory on multi-dimensional TCM codes can be obtained in [2,3]. However, to support the high data rates of EES, the receiver must be carefully designed. In particular, the decoder architecture must include sufficient parallelism and pipelining, together with an FPGA that offers the required computational and memory resources. The decoder architecture presented in this paper caters for this requirement. In implementing the proposed architecture, we use Altera’s Stratix II EP2S60 FPGA to meet the client’s requirements, although similar high performing devices are available from other vendors. In the remaining parts of this paper we present an overview of the proposed codec architecture in Section 2 and a three RAM based traceback unit in Section 3. Section 4 summarizes the hardware requirements to implement the proposed architecture and in Section 5 we present simulation and practical results. Finally, we conclude in Section 6.

*

Corresponding author. E-mail addresses: [email protected] (A.T.M.A. Rahman), [email protected] (W.G. Cowley).

1051-2004/$ – see front matter doi:10.1016/j.dsp.2009.06.003

©

2009 Elsevier Inc. All rights reserved.

264

A.T.M.A. Rahman, W.G. Cowley / Digital Signal Processing 20 (2010) 263–268

Fig. 1. Encoder and mapper of 4D 8PSK TCM with 2.75 Bits/Sym.

2. Overview of the architecture A block diagram of the proposed decoder architecture is given in Fig. 2. In this architecture, there are six main parts. These are the Euclidean Distance Calculator (EDC), 4D-metric calculator, Symbol Block Synchronizer (SBS), Add-CompareSelect (ACS) block, Survivor Management Unit (SMU) and Parallel Transition Decoder (PTD). In implementing the PTD, we have adopted the PTDT [2] technique. Adopting PTDT in our design has provided us significant hardware resources saving. Also, in implementing the 4D-metric calculator, we have developed a new technique which has saved a moderate amount of hardware. In the following, we present short descriptions of the proposed 4D-metric calculator along with the ACS unit. The 4D-metric calculator receives inputs from the EDC and sends the output to the PTD which calculates Branch Metrics (BM) to be used by the ACS block. After receiving Euclidean distances from the previous block, the 4D Metric Calculator divides them into cosets and finds coset survivors. Once sixteen coset survivors (four from each 8PSK constellation and four constellations in a 4D 8PSK TCM scheme) are available, it can calculate 4D-metrics in two different ways. These are (i) direct method and (ii) proposed method. In the direct method, it finds 256 combinations (taking four coset survivors at a time and each from a separate constellation, results in 44 = 256 combinations) of 4D metrics. Here, a 4D metric means the sum of the four coset survivors from four different constellations. In this method one requires three adders for each combination and hence a total of 768 adders for all the combinations. This method is very straight forward but requires 62% more adders compared to the proposed method. In the proposed method, we divide 4-constellations into two groups and then obtain 32 2D metrics (sixteen 2D-metrics from each group). By a 2D metric, we mean the sum of two coset survivors from two different constellations. Now, in calculating the 2D metric, one requires a total of 32 adders, one adder for each combination. Once these 32 2D metrics, sixteen from each group, are available, we reuse them to find 256 4D-metrics. In finding 4D metrics from 2D metrics, we take one 2D metric from each group and sum them up. In this way, finding 4D metrics from 2D metrics requires 256 adders for 256 combinations. Finally, the total number of required adders is 288 (32 adders for calculating 2D metrics and 256 adders for calculating 4D metrics from 2D metrics) compared to 768 for the direct method. However, one disadvantage of our method is that it requires two pipelined stages compared to one for the direct method. This additional delay is not a problem for EES. In implementing the ACS unit, one can use serial or parallel configurations [5]. The serial implementation requires less hardware resources compared to the parallel configuration but is much slower [5]. Considering the high throughput requirements of EES, we have implemented parallel ACS units. But, parallel implementation of ACS units alone cannot improve data rates of this unit up to that of the other blocks of the Viterbi decoder. This happens due to the presence of feedback connections inside ACS block. In [5–7] different techniques have been proposed to reduce the effects of the feedback connection in achieving higher throughput. In [8–10], the ACS unit has been replaced with a Compare-Select-Add (CSA) unit. These enhanced techniques provide performance improvement but are complex to implement and require a significant amount of additional hardware resources. Consequently, we have adopted unmodified ACS units with appropriate pipelining technique. Although pipelining is not free of side effects, it is easy to implement and requires less hardware compared to the methods of [5–10]. In addition to the additional hardware requirement, using pipelining inside the ACS units also adds few delay elements in the critical path. But, like before, latency due to pipelining is not a problem for our target applications. Finally, it is important to mention that full precision BM inputs to the ACS units from the PTD would be 11-bits (corresponding to a 5-bits received signal). But in our design we have used only 5-bits BM and the corresponding path metric (PM) precision is 7-bits. These quantizations of BM and PM have no impact on the performance of the decoder. However, if the quantization levels are further reduced, the performance of the decoder degrades. Impacts of the different BM and received signal quantization levels are shown in Section 5 of this paper. It is also important to mention that among the 11-bits (in the case of 2.75 Bits/symbol for the CCSDS standard) incoming information, we send only 3-bits through the ACS units and the remaining 8-bits (not checked by the convolutional encoder) are sent through a chain of delay elements consisting of FIFO’s.

A.T.M.A. Rahman, W.G. Cowley / Digital Signal Processing 20 (2010) 263–268

265

Fig. 2. Viterbi decoder.

Using FIFO’s instead of multiplexers (448 8-bit multiplexers) has increased the usage of memory and saved a huge amount of logic elements. The contents of these FIFO’s are used by the trace back unit for decoding purposes. 3. Three RAM based SMU In the literature there are two techniques available to implement the SMU [11]. These techniques are the register exchange method and the memory trace back method. The register exchange method uses ( S ∗ L ) two dimensional arrays of registers that resemble the trellis diagram of the relevant convolutional encoder. Here, S denotes the number of states in the convolutional encoder and L represents trellis truncation length. It is mentionable that the trellis truncation length depends on the encoder constraint length and code rate [11]. As the code rate increases, truncation length also increases and increasing truncation length means more hardware resources. Likewise, higher constraint length and hence the higher number states in convolutional encoder requires higher hardware. As a result for the case of high truncation length and high constraint length, the register exchange method becomes increasingly costlier in terms of resources and power consumption [10]. In our case where the code rate is 3/4, the truncation length is higher than the usual 5 ∗ K and there are sixty four states in the convolutional encoder, the register exchange method is not suitable. Hence, the only option that is available is the RAM based trace back method which is inherently low power, low throughput and high latency [10]. In this method, survivor branch indexes (info bits) from the ACS units are stored in the RAM and later data from the RAM is read in a certain order to traceback and decode. Though it is considered that the RAM based traceback method is slower compared to the register exchange method [10], it is not a problem in our case because in the multi-dimensional Viterbi decoding, the overall architecture runs only at the R s / L rate, where R s is the channel symbol rate and L is the dimension, which in our case is 4. Consequently, even for a very high data rate of 500 MSym/s, RAM based traceback method needs to run at only 125 MHz which is not a problem in our target device and provides a data rate of 1375 MBit/s (125 ∗ 4 ∗ 2.75). On the other hand high latency (in the order of micro seconds) is not a problem for EES. In the next few paragraphs we describe the proposed three RAM banks based traceback method in detail. This technique may be used in 1D or multi-dimensional Viterbi decoders. In this approach, three RAMs are used to implement traceback functions, as shown in Fig. 3. In this configuration RAM # 1 does the primary traceback, RAM # 2 performs the decoding while RAM # 3 is used for providing data in truetime order. In each RAM, there are four pages and all the pages are of same size. The size of a page (number of address locations) determines the traceback depth (in our case the page size is 64) and reading and writing in the RAM’s happen in cyclic order (page # 0 after page # 3). Incoming data of 192-bits (64 states and 3-bits of survivor branch index) from the ACS block is written to RAM # 1. Due to the nature of the traceback, writing and reading in RAM # 1 are done in opposite directions and in a page by page manner. The write operation starts from the top side of a page while the read operation starts from the bottom and reading from RAM # 1 is always done at least one page behind writing. Starting from any state (in our case it is the state zero) from the beginning of a page, with the help of MUX # 1, MUX # 2 and convolutional re-encoder # 1 (given a current state and a survivor branch index, the convolutional re-encoder provides the previous state), RAM # 1 traces the path back until

266

A.T.M.A. Rahman, W.G. Cowley / Digital Signal Processing 20 (2010) 263–268

Fig. 3. Traceback unit architecture. Table 1 Items ALUT’s Equivalent logic elements DSP blocks Memory bits

Stratix II EP2S60 48,352 60,440 288 2,544,192

Used resources

Percent of total

25,044 30,824 0 315,136

51 51 0 12

the boundary of a page is reached and then transfers the survived state to RAM # 2. After transferring relevant information to RAM # 2, the read and the write operations move to new pages. For example, after hitting the boundary of the page # 2, the read operation moves from the top end of page # 2 to the bottom end of page # 3 while the write operation moves to the top end of page # 0. Similarly, when the read operation crosses the boundary of page # 3, it moves to the bottom of page # 0 and the write operation goes to the beginning of page # 1 and so on. After getting a survived state from the RAM # 1, RAM # 2 starts decoding from where RAM # 1 stopped trace back and through MUX # 3, MUX # 4 and the convolutional re-encoder # 2 provides decoded data as survivor branch index in time reversed order. Since RAM # 1 and RAM # 2 operate in synchronism and the data from RAM # 1 is written to RAM # 2 in the same order and in the same page as they are retrieved, decoding from the place where RAM # 1 stopped trace back is not a problem. For instance, if we extend our last example for RAM # 2, then decoding needs to start from the page # 1 after the read operation in RAM # 1 reaches the boundary of the page # 2. As expected, due to the synchronism between RAM # 1 and RAM # 2, the read operation in RAM # 2 moves to the beginning of the page # 1 when the corresponding read operation in RAM # 1 goes to the beginning of the page # 3 ensuring proper decoding of the transmitted data. Similarly when the read operation in RAM # 1 moves to the page # 0, the corresponding read operation in RAM # 2 moves to the page # 2. Since RAM # 2 provides data in time reversed order, RAM # 3 is used to undo this effect and writing and reading in opposite directions in RAM # 3 ensures this. Finally, the output of the third RAM and other relevant information are used to select the appropriate FIFO’s from the delay chain described above, which provide the ultimate decoded data. 4. Required hardware resources Table 1 shows the required hardware resources to implement the proposed design for CCSDS 4D 8PSK TCM with 2.75 Bits/symbol. Implementation of other options of the CCSDS standard for EES will require less hardware resources compared to the current implementation. Values of the 4th column in Table 1 are rounded to the nearest percent. 5. Measured performance and simulation results In Fig. 4(a), we present simulation results showing the impact of BM and PM quantization for different received signal quantization while in Fig. 4(b) we show the impact of received signal quantization for fixed BM and PM quantization.

A.T.M.A. Rahman, W.G. Cowley / Digital Signal Processing 20 (2010) 263–268

(a) BM quantization

267

(b) Received signal quantization

Fig. 4. Impact of quantization in 4D 8PSK TCM decoder with 2.75 Bits/Sym.

Fig. 5. Performance of real time 4D 8PSK TCM decoder with 2.75 Bits/Sym.

From Fig. 5(b), it is obvious that the unquantized received signal and quantization down to 5-bits perform almost the same whereas 4-bits received signal quantization incurs a loss of around 0.4 dB at the target BER of 10−4 . In Fig. 4(a), it can be seen that almost no loss is incurred if we go from unquantized BM to 5-bits BM for a 5-bits received signal. But, further quantization of BM suffers performance loss. Finally, considering the overall impact of the received signal and BM quantization, in our design we have used 5-bits received, 5-bits BM and 7-bits PM quantization which provided us almost unquantized performance. Fig. 5 shows the performance result obtained from hardware. This result was obtained in real-time with digitally synthesized Gaussian noise. From this graph, we can see that the performance provided by the hardware for 4D 8PSK TCM with 2.75 Bits/symbol is slightly better than that of CCSDS published result [4] in the high SNR region and almost similar in the low SNR region. This is probably due to the 4-bits received signal quantization by the CCSDS compared to ours 5-bits quantization and the absence of the tail effects of the digitally synthesized noise source. Fig. 5 also shows that at the target BER of 10−5 , 4D 8PSK TCM provides a performance improvement of around 1 dB compared to the uncoded QPSK.

268

A.T.M.A. Rahman, W.G. Cowley / Digital Signal Processing 20 (2010) 263–268

6. Conclusion In this paper we have presented an efficient architecture for the 4D 8PSK TCM schemes. This architecture can provide a data rate of 1672 MBit/s in a current high performance FPGA. With some modification in two blocks, this architecture can be used for all four options available in the CCSDS standard for EES. This architecture can also be used for single dimensional Viterbi decoder (such as rate 1/2 with polynomial G1 = 171 and G2 = 133) provided that the parallel transition decoding block is removed. Similarly the new RAM based SMU can be used for any Viterbi decoder. Acknowledgments This project was supported by the Australian Research Council (project LP0455736) and Satellite Services BV of the Netherlands. The authors thank colleagues at ITR and SSBV for their suggestions plus the anonymous reviewers. References [1] CCSDS standard on “Radio Frequency and Modulation System – Part 1 Earth Stations and Spacecraft”, Blue Book, September, 2005. [2] G. Ungerboeck, Channel coding with multilevel/phase signals, IEEE Trans. Inform. Theory 28 (January 1982) 55–67. [3] S. Pietrobon, R.H. Deng, A. Lafanehere, G. Ungerboeck, D.J. Costello Jr., Trellis-coded multidimensional phase modulation, IEEE Trans. Inform. Theory 36 (January 1990) 63–89. [4] CCSDS documents on “Bandwidth-Efficient Modulations, Summary of Definitions, Implementation”, CCSDS 413.0-G-1, Green Book, December 2003. [5] G. Fettweis, H. Meyr, Parallel Viterbi algorithm implementation: Breaking the ACS-bottleneck, IEEE Trans. Commun. 37 (August 1989) 785–790. [6] P. Black, T. Meng, A 140 Mb/s 32-state radix-4 Viterbi decoder, IEEE J. Solid-State Circuits 27 (December 1992) 1877–1885. [7] K.K. Parhi, An improved pipelined MSB-first add-compare select unit structure for Viterbi decoders, IEEE Trans. Circuits Syst. I Regul. Pap. 51 (March 2004). [8] G. Fettweis, R. Karabed, P.H. Siegel, H.K. Thapar, Reduced-complexity Viterbi detector architectures for partial response signalling, in: Proc. IEEE Global Telecommunications Conference, November 1995, pp. 559–563. [9] I. Lee, J.L. Sonntag, A new architecture for the fast Viterbi algorithm, in: Proc. IEEE Global Telecommunications Conference, November 2000, pp. 1664– 1668. [10] E. Yeo, S.A. Augsburger, W.R. Davis, B. Nikolic, A 500-Mb/s soft-output Viterbi decoder, IEEE J. Solid-State Circuits 38 (July 2003). [11] G. Feygin, P. Gulak, Architectural tradeoffs for survivor sequence memory management in Viterbi decoders, IEEE Trans. Commun. 41 (March 1993) 425–429.

A.T.M. Anishur Rahman received his B.Sc. in Electrical and Electronic Engineering from Bangladesh University of Engineering and Technology (BUET), Dhaka, Bangladesh in 1998. After working for several years for the Bangladesh government, Anishur Rahman did his M.Eng. from the University of South Australia, Adelaide, Australia in 2007. From mid 2007, he is pursuing his Ph.D. in nanotechnology at the same university. His research interest includes radiation shielding using nanoparticles and in the general area of nanotechnology. Bill Cowley received his B.Sc. and B.E. degrees from the University of Adelaide in 1974 and 1975. After working initially for the Post Master General’s Department, he then joined the Defence Science and Technology Organisation. In 1985 he completed his Ph.D. with Adelaide University and moved to the Digital Communications Group at the South Australian Institute of Technology. This organisation later became the Institute for Telecommunications Research (ITR) at the University of South Australia. During the last 20 years Bill Cowley has worked mainly in modem signal processing. He has recently been leader of the Satellite Communications Program in the Australian Cooperative Research Centre for Satellite Systems (CRCSS). He is currently Professor of Communications Signal Processing in ITR. His technical interests include modem synchronisation, iterative decoding, satellite communications, free space optical communications and speech processing.