Parallel access by butterfly networks for any degree permutation polynomial and ARP interleavers

Available online at www.sciencedirect.com Journal of the Franklin Institute 356 (2019) 3139–3168 www.elsevier.com/locate/jfranklin Parallel access b...

Download PDF

788KB Sizes 0 Downloads 46 Views

Report

PDF Reader
Full Text

Available online at www.sciencedirect.com

Journal of the Franklin Institute 356 (2019) 3139–3168 www.elsevier.com/locate/jfranklin

Parallel access by butterfly networks for any degree permutation polynomial and ARP interleavers Lucian Trifina∗, Daniela Tarniceriu Department of Telecommunications and Information Technologies, “Gheorghe Asachi” Technical University, Faculty of Electronics, Telecommunications and Information Technology, Bd. Carol I, no. 11A, Iasi 700506, Romania Received 31 October 2017; received in revised form 12 August 2018; accepted 6 December 2018 Available online 30 January 2019

Abstract Using more processors for parallel turbo decoding is an important issue to speed up the processing at the receiver of a communication system. Butterfly networks used to map the addresses of extrinsic values represent an elegant and simple solution in parallel turbo decoding. Recently, it has been shown that quadratic permutation polynomial (QPP) interleavers allow an easy way to compute the control bits for a butterfly network. In this paper we show that not only QPP interleavers, but any degree permutation polynomial (PP) interleavers and almost regular permutation (ARP) interleavers also allow the same easy way to compute the control bits required in butterfly networks. As a consequence, it is useful to apply the butterfly networks in parallel turbo decoding when using these performant algebraic interleavers. © 2019 The Franklin Institute. Published by Elsevier Ltd. All rights reserved.

1. Introduction Nowadays communication systems with high processing speed at receiver side are a major and important requirement. Turbo codes gain much interest in error correcting codes area because of their very good performance when used on noisy communication channels. To speed up turbo decoding, more processors in parallel are used. For a parallel turbo decoding implementation the extrinsic values are stored in several memory banks. We denoted by ∗

Corresponding author. E-mail addresses: [email protected] (L. Trifina), [email protected] (D. Tarniceriu).

https://doi.org/10.1016/j.jfranklin.2018.12.018 0016-0032/© 2019 The Franklin Institute. Published by Elsevier Ltd. All rights reserved.

3140

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

L the number of the processors used. Because of the interleaver used in turbo codes, this approach leads to possible collisions when accessing the corresponding L memory banks since the extrinsic values required in turbo decoding are read and written in two different orders. Different interleavers have been designed to resolve the problem of collisions, but this method imposes constraints on the overall turbo code design. In [1] the problem of avoiding collisions in accessing the memory was solved for an arbitrary interleaver by a suitable mapping of the variables read/written in the memory. In [2] the problem of avoiding collisions in accessing the memory was solved for L a power of two and for an arbitrary interleaver by using butterfly networks to map the addresses of extrinsic values. Butterfly networks offer a simpler and less complex solution for parallel access to the memory. However, for an arbitrary interleaver, determining the control bits required for a routing as in [2] is cumbersome enough. Therefore, in [3], the author has particularized the determination of the control bits in a butterfly network for quadratic permutation polynomial (QPP) interleavers [4], obtaining an easy way to determine them for these particular interleavers. The control bits can be computed on-the-fly for QPP interleavers and thus initial separated processing and lookup tables are not needed as in [2]. This fact allows using parallel turbo decoding with lower complexity and high speed. These are the main advantages of using butterfly networks in parallel decoding of turbo codes with QPP interleavers. When implementing a parallel turbo decoder of turbo codes with QPP interleavers in an application specific integrated circuit (ASIC), the main target is to obtain high throughput with as small as possible area of chip and low power consumption. For a given number of processors used, the main challenge in parallel turbo decoder design is to find an efficient solution for routing the extrinsec values computed by the component soft-input soft-output (SISO) maximum a-posteriori probability (MAP) decoders. The two main parts used in implementation of a QPP interleaver for a parallel turbo decoder are the circuit for the generation of the physical address where the extrinsec values will be stored/read in/from the memory and the interconnection network. The interconnection network deals with the appropriate routing of the extrinsec values. To our knowledge the most known interconnection networks proposed in the literature are the crossbar network [5], the master-slave Batcher network [6], the Benes network [7], and the barrel shifter network [8]. In Section 5.1 we will show that when routing the same number of extrinsec values (equal to a power of two) with butterfly networks, the number of 2-input multiplexers and the number of full adders required for hardware implementation is smaller than or equal to the number required for the previous solutions. We note that in [7] it is mentioned that butterfly networks can be used with QPP interleavers. However, the whole turbo codeword block is assumed to be processed at decoder on different equally-sized subblocks for which the maximum contention free property is proved in [9]. The solution proved in [3] for QPP interleavers and in this paper for any degree PP interleavers and ARP interleavers with some constraints is more general and it offers more flexibility. As it is stated in [3], when using several pipelined units to compute the state metrics it is possible to process the whole trellis continuously, while with the proof from Takeshita [9] the processing has to be performed over some disjoint subblocks, and thus metrics initializations for each subblock must be done. An efficient and general solution for implementation of QPP interleaver for parallel turbo decoder is given in [10]. It is shown that the proposed solution for the memory reading circuit is more efficient in terms of the number of 2-input multiplexers and the number of full adders compared to the previous known solutions. Actually, the solution proposed in [10] uses

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

3141

a butterfly network based structure matched to four types of parallelization of turbo decoders. These are serial MAP (SMAP) or cross MAP (XMAP) strategies to compute the state metrics, the symbol based radix-2ν algorithm to compute the state metrics by merging ν trellis sections, the pipeline decomposition of the recursion computing for the forward and backward metrics required in MAP decoders, and, finally, the classical use of several processors which work as MAP decoders over subblocks of the original turbo codeword block. This solution uses more blocks of butterfly networks appropriatelly arranged so that the complexity is reduced. When the type of parallelization is only made by several MAP decoders that work on disjoint subbloks as in [9], the interconnection network reduces to one butterfly network. Since in the present paper we analyze the possibility to use only one butterfly network for all extrinsec values it is not fair to compare the implementation complexity with that from Wang et al. [10]. The possibility to use any degree PP and ARP interleavers with the solution from Wang et al. [10] is left for future work. 1.1. Motivation for our work QPP interleavers have been intensively studied in the last fifteen years. The topic of PP interleavers of degrees higher than two has gained interest in the last years (see [9,11–20] for some results). Although QPP interleavers lead to very good performance when they are carefully chosen, higher than two degree PP interleavers can overcome the performance of QPP interleavers. Firstly, in [13] the author provides a five degree PP interleaver of length 512 with performance similar to a dihtered relative prime (DRP) interleaver [31], known with the best known performance. Then, some better cubic PP (CPP) interleavers of short to medium lengths are provided in [15,16,19]. Recently, in [21] a partial upper bound for any degree PP interleavers has been established. In [20], up to five degree PP interleavers with optimum minimum distance were found and some PP interleavers which reach the partial upper bound in [21] were identified. Another class of interleavers with very good performance and simple implementation is that of the almost regular permutation (ARP) interleavers [22]. ARP interleavers are used in Digital Video Broadcasting Return to Channel Satellite (DVB-RCS) [23] and WiMax [24] standards. ARP interleavers were also proposed as an alternative for QPP interleavers in LTE standard [25]. In [3], the use of butterfly networks with on-the-fly determination of control bits is proved only for QPP interleavers. Our concern in this paper is to show that any degree PP interleavers and ARP interleavers with some constraints, allow the same on-the-fly determination of control bits for routing the extrinsec values with butterfly networks in parallel turbo decoding. As a consequence, the advantages of using butterfly networks with on-the-fly determination of control bits hold for these interleavers. The 5G wirelles network is proposed to be used in 2020. Turbo codes used in the previous 3G and 4G networks are under study to be replaced with another two channel code classes which approach the channel capacity [26,27], namely low density parity check (LDPC) codes [28] and polar codes [29]. The reason for this change would be the fact that turbo codes cannot achieve enough high throughputs and that they have higher complexity than LDPC and polar codes. However, in [30], these two statements have been unconfirmed. AccelerComm has demonstrated that turbo codes can achieve decoded throughputs exceeding the 5G target of 20 Gbps. In [30] it was shown that overall implementation complexity of a channel code depends not only on its computation complexity, but also on its interconnect complexity and

3142

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

its inherent flexibility. From the three near-optimal channel code classes, turbo codes have higher degrees of flexibility at the lowest complexity. These features are due to the regular and flexible structure of turbo codes, in contrast to the structures of LDPC and polar codes. Additionally, turbo codes can offer the advantage of backwards compatibility to 3G and 4G. The interconnect complexity of turbo codes can be the lowest with butterfly networks. Using any degree PP and ARP interleavers in parallel turbo decoding with butterfly networks, we can achieve good error correction, high flexibility, low latency and low complexity, such as those required in 5G communication systems. In addition, these interleavers can be designed without other constraints (except their length) when parallel turbo decoding is used. These issues presented above have motivated our work. 1.2. Contributions The main contributions of this paper are: 1) we show that not only QPP interleavers, but any degree permutation polynomial (PP) allow the same easy way for computing the control bits required in butterfly networks as in [3]. 2) we prove some properties of ARP parameters and give a way to construct ARP interleavers with R = kR · R component LPPs from an ARP with R LPPs, where R, kR , and R are positive integers. 3) we show that ARP interleavers consisting of R component LPPs, with R a power of two allow the same easy way to compute the control bits required in the butterfly networks as in [3] when the number of processors used in parallel turbo decoding is a power of two, dividing the interleaver length, greater than or equal to R. If the ARP parameters fulfill some more constraints, then the number of used processors can be smaller than R. 4) we make a theoretical comparison from the point of view of implementation complexity of butterfly networks and other previous known interconnection networks when using any degree PP interleavers and ARP interleavers with some constraints. The implementation complexity is assessed in terms of the number of 2-input multiplexers and the number of full adders required for implementation of the interconnection network and to generate the interleaved addresses. We show that the implementation complexity of butterfly networks with interleavers analyzed in this paper is lower than or equal to that of the previous known interconnection networks. Additionally, butterfly networks with possibilities to store the extrinsec values proved in the paper offer more flexibility than classical solution with disjoint equally-sized subblocks. Recently, in [32], it has been shown that ARP interleavers represent a more general model because DRP and QPP interleavers can be described by ARP model. The representation QPP interleavers by ARP model was extended to cubic permutation polynomial (CPP) interleavers in [19]. Thus, all these performant interleavers can be used with the same facility when parallel turbo decoding uses butterfly networks. 1.3. Structure of the paper The paper is structured as follows. Section 2.2 presents a background for parallel access by butterfly networks, in Sections 3 and 4 it is proved that any degree PP and ARP interleavers, respectively, can be used with butterfly networks with the same facility of “on the fly” control

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

3143

bits computing, as it is proved for QPP interleavers in [3]. In Section 4, the condition for parameters of an ARP interleaver is also proved and two ways to choose the parameters for these interleavers are provided. In Section 5.1 we make a theoretical comparison of the most known interconnection networks from the point of view of implementation complexity. In Section 5.2 we give two examples for a five degree PP interleaver and two examples for an ARP interleaver, showing for each of them the physical interleaved addresses and the control bits when using these interleavers with butterfly networks. Finally, Section 6 concludes the paper. 2. Preliminaries 2.1. Notations In the paper, the following notations will be used: • • • • • • • • • • • •

N is the set of natural numbers; N∗ is the set of natural numbers greater than zero; WN , with N ∈ N∗, is the set {0, 1, . . . , N − 1}; P = 2, 3, 5, . . . is the set of prime numbers; Cnk with n, k ∈ N∗ and n ≥ k is the binomial coefficient (i.e. n choose to k); p|N, with p, N ∈ N, stands for p divides N; gcd (a, b) stands for the greatest common divisor of non-negative integers a and b; (mod n), with n ∈ N∗ , stands for modulo n operation; x, with x a real number, stands for the smallest integer greater than or equal to x; π (x) stands for the formal derivative of polynomial π (x); ∀ stands for “for all”; in a proof the sign “⇐” stands for the beggining of the direct proof and the sign “⇒” stands for the beggining of the inverse proof.

For N ∈ N∗ we consider the prime factorization of N as: N= pnN,p ,

(1)

p∈P

where nN,p ≥ 1 for a finite number of prime numbers p and nN,p = 0 in all the other cases. 2.2. Parallel access by butterfly networks Let the interleaver length be factorized as N = L · M for positive integers L and M. Let the functions aj (k): WL × WM → WN be defined so that aj (k) = ai (k), ∀k ∈ WM , ∀i, j ∈ WL with j = i, and ∀x ∈ WN , there is a unique function a jx , so that a jx (kx ) = x. For a parallel turbo decoding implementation, the linear and interleaved accesses to the memory a banks at time k ∈ W are defined by the address vectors (k) , a1 (k), . . . , aL−1 (k) and M 0 π (a0 (k)), π (a1 (k)), . . . , π (aL−1 (k)) , respectively. To avoid the collisions at a certain moment at the same memory bank, there must exist a function FL : WN → WL , so that 1) FL (ai (k)) = FL (aj (k)) (linear parallel access) 2) FL (π (ai (k))) = FL (π (aj (k))) (interleaved parallel access)

3144

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

Fig. 1. 16 × 16-butterfly network.

∀i, j ∈ WL , i = j, and k ∈ WM . If there exists a function FL for an interleaver over WN , then we say that the interleaver is contention-free for parameters L and M. In this case, a parallel turbo decoding implementation is possible with L processors and L memory banks, each of them with M memory cells. The linear parallel access for any degree PP interleaver with functions a j (k) = j · M + k was proved in [9]. In [3], Theorem 1 proved that any QPP interleaver of length N = 2n · M, with n and M positive integers, is contention-free by the function F2n (x) = x(mod 2n ),

(2)

and the function from Eq. (2) provides exactly the same mapping of addresses as a 2n × 2n butterfly network does. In Fig. 1 a 16 × 16-butterfly network and sixteen memories are shown. The 16 × 16-butterfly network consists of thirty-two 2-by-2 crossbar switches. Each 2-by-2 crossbar switch is controlled by one bit. If the bit is zero, a direct connection is performed, and if the bit is one, a cross connection is performed. In Fig. 1 the thirty-two control bits x0 , x1 , . . . , x31 are set to zero. Hence all 2-by-2 crossbar switches perform a direct connection. To get from an input bit i to an output bit j, four control bits are required [33]. If the binary representations of decimal numbers i and j are (i3 i2 i1 i0 )2 and (j3 j2 j1 j0 )2 , respectively, then the control bits cbk , with k = 0, 1, 2, 3, are computed by cbk = ik jk , where is the modulo 2 operator. The

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

3145

bit with index 0 in a binary representation is the least significant bit. The first control bit used is cb0 , then cb1 , then cb2 , and then cb3 , i.e. the four control bits are applied from the left to the right side of the butterfly network. The definition of a uniformly 2n -dyadic vector is used in Theorem 1 from [3]. We give a generalization of this definition below: Definition 2.1. Let p be a prime number. A vector a0 (k), a1 (k), . . . , a pn −1 (k) of integer components al ≥ 0, l ∈ Wpn , is uniformly pn -dyadic if a pq k+i = a pq k+ j (mod pq ),

(3)

∀i, j ∈ Wpq with i = j, ∀k ∈ Wpn−q , and ∀q = 1, 2, . . . , n. 3. Parallel access by butterfly networks for any degree PP interleavers 3.1. Previous results on permutation polynomials We begin this section with the definition of a PP. Then we give three theorems and a lemma useful for getting the results in this section. Definition 3.1. The polynomial of degree d, modulo N: π (x) = q0 + q1 x + q2 x 2 + · · · + qd x d (mod N ),

(4)

where N is a positive integer, is a PP if the coefficients qk , k = 1, . . . , d, are chosen so that the set {π (0), π (1), . . . , π (N − 1)}, modulo N, is a permutation of the set WN . The free term q0 only determines a cyclic shift of the permutation elements. Thus, we may and we will assume that q0 = 0. Theorem 3.2 ([4,11]). For any N = p∈P, pnN,p , π (x) is a PP modulo N iff π (x) is also a p|N

PP modulo pnN,p , for any p so that nN,p ≥ 1. Theorem 3.3 ([4,11,34,35]). π (x) is a PP modulo pn , with n > 1, iff π (x) is a PP modulo p and π (x) = 0(mod p) for every integer x. Theorem 3.3 is a direct consequence of Theorem 123 from [36], as mentioned in [35]. Theorem 3.4 ([37]). For N = 2n , with n ∈ N, n > 1, π (x) in (4) is PP iff q1 = 1(mod 2), (q2 + q4 + q6 + · · · ) = 0(mod 2) and (q3 + q5 + q7 + · · · ) = 0(mod 2). Lemma 3.5 ([37]). π (x) from 1(mod 2).

Eq. (4) is PP modulo 2 iff (q1 + q2 + q3 + · · · + qd ) =

3.2. Proof for parallel access by butterfly networks for any degree PP interleavers In this section we prove that any degree PP interleaver of length N = 2n · M is contentionfree by the function (2). Actually, we prove that Lemmas 3, 4 and 5 from [3] are fulfilled for a PP of any degree, not only for QPPs. First we prove a property of a PP of any degree. Lemma 3.6. Consider a PP of degree d as in Eq. (4). Then, ∀L ∈ WN , so that L|N, and ∀x ∈ WN , we have π (x)(mod L) = π x(mod L) (mod L).

3146

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

Proof. Let x be expressed as x = xL + k · L,

(5)

so that xL ∈ WL and k ∈ N, i.e. xL = x (mod L ). Then we can write the ith power of x from Eq. (5) modulo L, with i ∈ Wd , as x i (mod L) = (xL + k · L )i (mod L ) =

i

Cij · (xL ) j · (k · L )i− j (mod L )

j=0

=

i−1

Cij · (xL ) j · (k · L )i− j (mod L ) + (xL )i (mod L) = (xL )i (mod L)

(6)

j=0

The last equality is true since Cij ∈ N and i − j > 0, ∀ j = 0, 1, . . . , i − 1, the indeces in the sum from the second line of Eq. (6). With Eqs. (6) and (4) the lemma results immediately. Now we restate Lemmas 3, 4, and 5 from [3] for a PP of any degree. In fact, in Lemmas 3.7 and 3.8 we prove more general results for any degree PP. Lemma 3.7. Let n and M be positive integers and N = pn · M, where p is any prime number. Assume that π is a PP of arbitrary degree d on WN , as in Eq. (4). Let x and y be in WM . Then x = y(mod pn ),

(7)

if and only if π (x) = π (y)(mod pn ).

(8)

Proof. Let nmax be the greatest positive integer so that p | N . From Theorem 3.2 we have that π is a PP of degree d on Wpnmax . If nmax > 1, from Theorem 3.3 we have that π is a PP of degree d on Wp and π (x) = 0(mod p), ∀x ∈ Wp . Because π is a PP of degree d on Wp and π (x) = 0(mod p), ∀x ∈ Wp , from Theorem 3.3 it also results that π is a PP of degree d on Wpq , ∀q ∈ N with q ≥ 2. Taking into account Lemma 3.6 this means that Eqs. (7) and (8) implie each other, ∀n ∈ N∗ , so that pn |N. nmax

For p = 2, Lemma 3.7 gives the same result as Lemma 3 from [3], but for a PP of any degree. Lemma 3.8. Let n and M be positive integers and N = pn · M, where p is any prime number. Assume that π is a PP of arbitrary degree d on WN , as in Eq. (4). Assume that ai are in WN for every i in Wpn . Then Aa = (a0 , a1 , . . . , a pn −1 ) is uniformly pn -dyadic if and only if Aπ = π (a0 ), π (a1 ), . . . , π (a pn −1 ) is uniformly pn -dyadic. Proof. As Lemma 4 from [3], Lemma 3.8 is a direct consequence of Lemma 3.7. We give the proof for completeness. If we detail Eq. (3) for any q = 1, 2, . . . , n we have a pk = a pk+1 = · · · = a pk+ p−1 (mod p), ∀k ∈ Wpn−1 , a p2 k = a p2 k+1 = · · · = a p2 k+ p2 −1 (mod p2 ), ∀k ∈ Wpn−2 ,

(9) (10)

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

3147

a.s.o. a pn−1 k = a pn−1 k+1 = · · · = a pn−1 k+ pn−1 −1 (mod pn−1 ), ∀k ∈ Wp ,

(11)

a0 = a1 = · · · = a pn −1 (mod pn ).

(12)

Taking into account Lemma 3.7, Eqs. (9)–(12) hold if and only if π (a pk ) = π (a pk+1 ) = · · · = π (a pk+ p−1 )(mod p), ∀k ∈ Wpn−1 ,

(13)

π (a p2 k ) = π (a p2 k+1 ) = · · · = π (a p2 k+ p2 −1 )(mod p2 ), ∀k ∈ Wpn−2 ,

(14)

a.s.o. π (a pn−1 k ) = π (a pn−1 k+1 ) = · · · = π (a pn−1 k+ pn−1 −1 )(mod pn−1 ), ∀k ∈ Wp ,

(15)

π (a0 ) = π (a1 ) = · · · = π (a pn −1 )(mod pn ).

(16)

But Eqs. (13)–(16) mean that the vector Aπ is uniformly pn -dyadic. Thus, the lemma is proved. For p = 2, Lemma 3.8 gives the same result as Lemma 4 from [3], but for a PP of any degree. The next lemma allows an easy deriving of control bits of a 2n × 2n butterfly network for a PP of any degree. Lemma 3.9. Let n and M be positive integers and N = 2n · M. Assume that π is a PP of arbitrary degree d on WN , as in Eq. (4). Then for any x ∈ WN , we have π (x + k2n−1 ) = π (x) + k(mod 2) 2n−1 (mod 2n ), ∀k ∈ WN . (17) Proof. We have π (x + k2n−1 ) =

d

qi · (x + k2n−1 )i (mod 2n )

i=1

=

d

qi ·

i=1

=

d

i

Cij · x j · (k2n−1 )i− j (mod 2n )

j=0

i−1 qi · x i + Cij · x j · (k2n−1 )i− j (mod 2n )

i=1

= π (x) +

j=0 d

qi ·

i−1

i=1

= π (x) + q1 · k · 2

Cij

· x · (k2 j

)

n−1 i− j

(mod 2n )

j=0 n−1

+

d i=2

qi ·

i−1

Cij

· x · (k2 j

)

n−1 i− j

(mod 2n ).

(18)

j=0

If k = 0(mod 2), i.e. k = 2 · l, with l ∈ N, from Eq. (18) we have π (x + k2n−1 ) = π (x)(mod 2n ),

(19)

3148

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

i.e. Eq. (17) for k = 0(mod 2). If k = 1(mod 2), i.e. k = 2 · l + 1, with l ∈ N, from Eq. (18) we have

d i−1 π (x + k2n−1 ) = π (x) + q1 · 2n−1 + qi · Cij · x j · (2n−1 )i− j (mod 2n ). i=2

(20)

j=0

For i − j ≥ 2 and n ≥ 2 we have (i − j)(n − 1) ≥ n, and thus (2n−1 )i− j = 0(mod 2n ). Then, for n ≥ 2, Eq. (20) is reduced to π (x + k2

n−1

) = π (x) + q1 · 2

n−1

+

d

qi · Cii−1 · x i−1 · 2n−1 (mod 2n )

i=2

= π (x) + q1 · 2n−1 +

d

qi · i · x i−1 · 2n−1 (mod 2n )

i=2

d = π (x) + 2n−1 · q1 + qi · i · x i−1 (mod 2n ).

(21)

i=2

For x = 0(mod 2), Eq. (21) becomes

d π (x + k2n−1 ) = π (x) + 2n−1 · q1 + qi · i · (2l )i−1 (mod 2n ) = π (x) + 2

n−1

i=2 n

· q1 (mod 2 ) = π (x) + 2n−1 (mod 2n ),

(22)

i.e. Eq. (17) for k = 1(mod 2). The last equality in Eq. (22) is true because π is PP on W2n , and thus, considering Theorem 3.4, q1 = 1(mod 2). For x = 1(mod 2), Eq. (21) becomes

d π (x + k2n−1 ) = π (x) + 2n−1 · q1 + qi · i · (2l + 1)i−1 (mod 2n ) i=2

d = π (x) + 2n−1 · q1 + qi · i (mod 2n ) i=2

= π (x) + 2 · q1 + 2q2 + 3q3 + · · · + dqd (mod 2n ) = π (x) + 2n−1 · q1 + 3q3 + 5q5 + 7q7 + · · · + 2n−1 · 2q2 + 4q4 + 6q6 + · · · (mod 2n ) n−1

=0(mod 2n )

= π (x) + 2n−1 · q1 + q3 + q5 + q7 + · · · + 2n−1 · 2q3 + 4q5 + 6q7 + · · · (mod 2n ) =0(mod 2n )

= π (x) + 2

n−1

= π (x) + 2

n−1

·

q1 +2n−1 · q3 + q5 + q7 + · · · (mod 2n )

=1(mod 2)

(mod 2 ), n

=0(mod 2)

(23)

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

3149

i.e. Eq. (17) for k = 1(mod 2). The last equality in Eq. (22) is true because π is PP on W2n , and thus q1 = 1(mod 2) and q3 + q5 + q7 + · · · = 0(mod 2) from Theorem 3.4. Eqs. (21)–(23) are valid for n ≥ 2. For n = 1, Eq. (20) becomes π (x + k) = π (x) + q1 +

d

qi ·

i−1

i=2

= π (x) + q1 +

d

= π (x) +

qi +

i=1

qi +

= π (x) + 1 +

d

qi ·

(mod 2)

i−1

i=2 d

qi ·

qi ·

i=2

Cij · x j (mod 2)

j=1

i−1

i=2 d

·x

j

j=0

i=2 d

Cij

Cij · x j (mod 2)

j=1

i−1

Cij

·x

j

(mod 2).

(24)

j=1

The last equality in Eq. (24) is true because π is PP on W2 , and thus from Lemma 3.5. For x = 0(mod 2), Eq. (24) becomes π (x + k) = π (x) + 1(mod 2),

d i=1

qi = 1(mod 2)

(25)

i.e. Eq. (17) for k = 1(mod 2) and n = 1. For x = 1(mod 2), Eq. (24) becomes π (x + k) = π (x) + 1 +

d

qi ·

i=2

= π (x) + 1 +

d i=2

= π (x) + 1 +

d

i−1

Cij

(mod 2)

j=1

qi ·

i

Cij − 1 − 1 (mod 2)

j=0

qi · (1 + 1)i − 2 (mod 2)

i=2

= π (x) + 1 +

d

qi · 2i − 2 (mod 2) = π (x) + 1(mod 2)

(26)

i=2

i.e. Eq. (17) for k = 1(mod 2) and n = 1. Thus, the proof is completed.

The usefullness of Lemma 3.9 is shown in the following. Let there be x ∈ W2n . Let x = xn−1 xn−2 . . . x1 x0 2 and π (x)(mod 2n ) = πx,n−1 πx,n−2 . . . πx,1 πx,0 2 be the base 2 expressions of x and π (x)(mod 2n ), respectively. The bit with index 0 is the least significant bit and the bit with index n − 1 is the most significant bit. From Lawrie [33] we know that to get from an input pin x of a 2n × 2n butterfly network to the output pin π (x)(mod 2n ), the control bits are obtained by equation cbx, j = x j πx, j , ∀ j = 0, 1, . . . , n − 1, ∀x ∈ W2n ,

(27)

3150

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

where the control bits cbx,j for j = 0, 1, . . . , n − 1 are taken from the left to the right of the butterfly network. The decimal value of the control bits (cbx,n−1 cbx,n−2 . . . cbx,1 cbx,0 ) for the input pin x and the output pin π (x)(mod 2n ) is denoted by cbx . From Eq. (17), we have cbx+2n−1 = π (x + 2n−1 )(mod 2n ) 2 x + 2n−1 (mod 2n ) 2 = (π (x) + 2n−1 )(mod 2n ) 2 x + 2n−1 (mod 2n ) 2 = (πx,n−1 πx,n−2 . . . πx,1 πx,0 )2 + (1 0 . . . 0 0 2 (mod 2n ) n−1 bits 0

. 0 0 )2 (mod 2n ) 2 (xn−1 xn−2 . . . x1 x0 )2 + (1 0 . .

n−1 bits 0

= (πx,n−1 1)πx,n−2 . . . πx,1 πx,0

2

(xn−1 1)xn−2 . . . x1 x0 2

= (πx,n−1 πx,n−2 . . . πx,1 πx,0 )2 (xn−1 xn−2 . . . x1 x0 )2 .

(28)

Using Eq. (28) for n = 1, 2, . . . we have cb0,0 = cb1,0 = · · · = cbn−1,0 = π0,0 , cb0,1 = cb2,1 = cb4,1 = · · · = cb2n −2,1 = π0,1 , cb1,1 = cb3,1 = cb5,1 = · · · = cb2n −1,1 = π1,1 , cb0,2 = cb4,2 = cb8,2 = · · · = cb2n −4,2 = π0,2 , cb1,2 = cb5,2 = cb9,2 = · · · = cb2n −3,2 = π1,2 , cb2,2 = cb6,2 = cb10,2 = · · · = cb2n −2,2 = π2,2 , cb3,2 = cb7,2 = cb11,2 = · · · = cb2n −1,2 = π3,2 , ..................................................................... cb0,n−1 = cb2n−1 ,n−1 = π0,n−1 , cb1,n−1 = cb2n−1 +1,n−1 = π1,n−1 , . . . cb2n−1 −1,n−1 = cb2n −1,n−1 = π2n−1 −1,n−1 .

(29)

From Eq. (29) it results that of all n · 2n control bits only 2n − 1 values need to be stored. 4. Parallel access by butterfly networks for ARP interleavers 4.1. Generating ARP interleavers The definition of an ARP interleaver is given below. Definition 4.1. An ARP interleaver modulo N is defined as ⎧ Px + P0 (mod N ), if x = 0(mod R), ⎪ ⎪ ⎨ Px + P1 (mod N ), if x = 1(mod R), π (x) = · · · · · · · · · · · · · · · · · · · · · · ················· ⎪ ⎪ ⎩ if x = R − 1(mod R), Px + PR−1 (mod N ),

(30)

where R ∈ WN , so that R|N. π (x) from Eq. (30) is an interleaver modulo N only if gcd (P, N ) = 1. In the following we obtain the conditions for free terms P0 , P1 , ..., PR−1 , so that π (x) from Eq. (30) is an interleaver.

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

3151

We note that Definition 4.1 of an ARP is not exactly the one given in [22], but it was used in [32]. In [22] parameters P0 , P1 , ..., PR−1 are expressed in terms of two vectors, each of them with R components, and of parameter P. Definition 4.1 is closer to the definition of a parallel linear permutation polynomial (PLPP) from [15,21,38]. Actually, an ARP as in Eq. (30) is a PLPP with the coefficients of linear terms equal to each other. In the following we refer to an interleaver defined by Eq. (30) as an ARP one. In the beginning we prove a lemma with a property of linear permutation polynomials (LPPs). A LPP is a first degree PP. Lemma 4.2. Let there be two LPPs πa (x) = Px + Pa (mod N ) and πb (x) = Px + Pb (mod N ) on WN . Then we can obtain the permutation elements given by LPP π b (x) by a cyclic shift to the left of the permutation elements given by LPP π a (x) with a number k ∈ WN , where k is the unique modulo N solution of the congruence equation P · k = Pb − Pa (mod N ).

(31)

Proof. The permutation elements of LPP π a (x) cyclically shifted to the left with k positions lead to the permutation P · (x + k) + Pa = P · x + (P · k + Pa )(mod N )

(32)

If we impose that the permutation generated with Eq. (32) is the same as that generated by LPP π b (x), we obtain the congruence Eq. (31). Because gcd (P, N ) = 1, Eq. (31) has only one solution modulo N in variable k for given Pa and Pb (Theorem 57 from [36]). We note that if R ∈ WN so that R|N, then equation P · k = Pb − Pa (mod R)

(33)

has only one solution modulo R since gcd (P, R) = 1, and this solution is equal to the solution of Eq. (31) evaluated modulo R. The next proposition gives some results from the theory of numbers. Proposition 4.3. Let there be N = L · M with L and M positive integers. Let there be x, y ∈ N. We have that 1) 2) 3) 4)

If If If If

x = y(mod N ) then x = y(mod L). x = y(mod L) then not necessarily x = y(mod N ). x = y(mod L) then x = y(mod N). x = y(mod N) then not necessarily x = y(mod L).

Proof. 1) Let there be xN = x(mod N ) and yN = y(mod N ). We can write x = xN + kx · N and y = yN + ky · N with kx , ky ∈ N. From hypothesis we have xN = yN . Then we have x (mod L ) = (xN + kx · N )(mod L) = (xN + kx · L · M )(mod L) = xN (mod L). Similarly, y(mod L) = yN (mod L). Since xN = yN , it results that x = y(mod L). 2) As a counterexample for this case, we can have x = 0 and y = L. 3) We can write x = xN + kx,N · N = xL + kx,L · L + kx,N · N and y = yN + ky,N · N = yL + ky,L · L + ky,N · N with kx,N , kx,L , ky,N , ky,L ∈ N, where xN = x(mod N ), xL = x (mod L ) = xN (mod L), yN = y(mod N ) and yL = y(mod L ) = yN (mod L ). Since xN < N and yN < N,

3152

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

we have kx,L ≤ M − 1 and ky,L ≤ M − 1. From the hypothesis we have xL = yL and we assume that xL < yL . Then (y − x)(mod N ) = yL − xL + L · (ky,L − kx,L ) (mod N ) = (yL − xL + L · kdi f ,M )(mod N ), where kdi f ,M = (ky,L − kx,L )(mod M ). Since 0 ≤ kdi f ,M ≤ M − 1 and 1 ≤ yL − xL ≤ L − 1, we have that 1 ≤ yL − xL + L · kdi f ,M ≤ N − 1. But this means that (y − x) = 0(mod N ). 4) As a counterexample for this case, we can have again x = 0 and y = L. We have the next property for the parameters of an ARP as in Eq. (30). Lemma 4.4. Consider an ARP as in Eq. (30). Then, parameters P0 , P1 , . . . , PR−1 of the ARP fulfill the following conditions P · ( j − i) = Pi − Pj (mod R), ∀i, j ∈ WR with i < j.

(34)

Proof. Let us consider two arbitrary LPPs in the definition of an ARP, πi (x) = P · x + Pi (mod N ) and π j (x) = P · x + Pj (mod N ), where i, j ∈ WR with i < j. According to Lemma 4.2, the permutation elements given by LPP π i (x) are obtained by a cyclic shift to the left of the permutation elements given by LPP π j (x) with a number k, where k is the solution of the congruence from Eq. (31) with Pa = Pj and Pb = Pi . The values from each LPP are taken with a step equal to R. If k(mod R) = j − i then there exist equal elements in the final ARP permutation that includes the two LPPs and thus, we have not an interleaver. In order to not result the same values for the two different LPPs at different values of x modulo R it is required that P · ( j − i) = Pi − Pj (mod R). Since i and j can be arbitrary in the set WR , we have the result in Lemma 4.4.

(35)

We remark that if all parameters P0 , P1 , . . . , PR−1 are of the form k · R + k0 , with k, k0 ∈ N, i.e. multiples of R plus a constant value, we have Pi − Pj = 0(mod R), ∀i, j ∈ WR with i < j. Because P and R are relatively prime to each other and j − i = 0(mod R), it results that P · ( j − i) = 0(mod R). Thus, the condition from Eq. (34) is always fulfilled when parameters P0 , P1 , . . . , PR−1 are multiples of R plus a constant value. In [22] it is specified that if the free terms of the component LPPs of an ARP are multiples of R, then Eq. (30) provides a valid interleaver. This situation is a particular case of the previously mentioned one, when k0 = 0. In the following we give a lemma that gives a way to construct ARPs on WN for R = kR · R component LPPs from an ARP on WN with R LPPs. Lemma 4.5. Let N be the interleaver length. Let P, P0 , P1 , . . . , PR −1 the parameters of an ARP on WN with R component LPPs so that Pi = kPi · R + k0 , kPi ∈ N, k0 ∈ WR , ∀i ∈ WR . Then, parameters P, P0 , P1 , . . . , PR−1 fulfill the conditions for an ARP on WN with R = kR · R , kR ∈ N∗ and R|N, component LPPs if PiR +ki ·R = PiR (mod R), ∀iR ∈ WR and ∀ki ∈ WkR . Proof. Because P is relatively prime with N and R|N, it results that P is relatively prime with R. We have to prove that parameters P, P0 , P1 , . . . , PR−1 fulfill the conditions from Eq. (34) from Lemma 4.4. Let there be i, j ∈ WR with i < j. We write the indices i and j as i = ki · R + iR and j = k j · R + jR , respectively, with ki , k j ∈ WR/R and iR , jR ∈ WR . Then, we have Pi − Pj (mod R) = PiR − PjR (mod R)

(36)

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

3153

and P · ( j − i) = P · ( jR − iR ) + P · R · (k j − ki )(mod R)

(37)

First we assume that iR = jR . Then, from Eq. (36) we have Pi − Pj = 0(mod R) and from Eq. (37) we have P · ( j − i) = P · R · (k j − ki )(mod R). Because i < j we have ki < kj and k j − ki ≤ R/R − 1. Then, because P is relatively prime with R we have that P · ( j − i) = 0(mod R), i.e. condition (34) for Pi − Pj = 0(mod R). Now we assume iR = jR . Because parameters P, P0 , P1 , . . . , PR −1 are valid for an ARP on WN with R component LPPs we have P · ( jR − iR ) = PiR − PjR (mod R ), ∀iR , jR ∈ WR with iR = jR .

(38)

Because all parameters P0 , P1 , . . . , PR −1 are multiples of R plus a constant, we have PiR − PjR = 0(mod R ) = kPi j · R (mod R) with kPi j ∈ WR/R .

(39)

Taking into account Eq. (39), from Eq. (38) it results that P · ( jR − iR ) = kP · R + k0,P (mod R), with kP ∈ WR/R and k0,P ∈ WR , k0,P = 0.

(40)

Using Eq. (40) in Eq. (37), we have P · ( j − i) = R · P · (k j − ki ) + kP + k0,P (mod R).

(41)

Taking into account that k0,P = 0, from Eqs. (41) and (39) condition (34) results. Thus, the proof is completed. We note that an example of an ARP constructed according to Lemma 4.5 is given in [22] for N = 5472, R = 4 and R = 3 · 4 = 12. For this ARP P = 97, P0 = 0, P1 = 24. P2 = 404, P3 = 1588, P4 = 1176, P5 = 1200, P6 = 404, P7 = 412, P8 = 0, P9 = 1200, P10 = 1580, and P11 = 1588. We observe that P0 = P1 = P2 = P3 = 0(mod 4), P0 = P4 = P8 = 0(mod 12), P1 = P5 = P9 = 0(mod 12), P2 = P6 = P10 = 8(mod 12), and P3 = P7 = P11 = 4(mod 12). 4.2. Proof for Parallel access by butterfly networks for ARP interleavers In this section we prove two lemmas similar to Lemmas 3.7 and 3.9, but for an ARP interleaver as in Eq. (30). Lemma 4.6. Let p be a prime number and nmax be the greatest positive integer, so that (pnmax ) | N . Assume that π is an ARP on WN as in Eq. (30) with R = pn where n ∈ N and n ≤ nmax . Let m be a positive integer, so that n ≤ m ≤ nmax and M = N/pm . Let x and y be in WM . Then x = y(mod pm )

(42)

if and only if π (x) = π (y)(mod pm ).

(43)

Proof. Consider r ∈ N with r ≤ nmax − n. According to Lemma 4.4 and Proposition 4.3, case 3, we have P · ( j − i) = Pi − Pj (mod pr · R), ∀i, j ∈ WR with i < j.

(44)

“⇒” Let there be x, y ∈ WM with x and y fulfilling Eq. (42). Let there be xR = x(mod pn ) and yR = y(mod pn ).

3154

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

If xR = yR , we have π (x)(mod pm ) = P · x + PxR (mod pm ) and π (y)(mod pm ) = P · y + PxR (mod pm ). Since P · x + PxR is a LPP modulo N, it is also a LPP modulo pm . Then, since x = y(mod pm ), it results that π (x) = π (y)(mod pm ). If xR = yR , we assume that xR < yR . Then, we have x = xR + kx · pn and y = yR + ky · pn with kx , ky ∈ N. Thus π (y) − π (x)(mod pm ) = π (yR + ky · pn ) − π (xR + kx · pn )(mod pm ) = P · (yR − xR ) + PyR − PxR + P · pn · (ky − kx )(mod pm ).

(45)

From Eq. (44), with r = 0, j = yR , and i = xR , we have P · (yR − xR ) + PyR − PxR = 0(mod pn ). Then P · (yR − xR ) + PyR − PxR = zR + kz · pn with zR ∈ WR , zR > 0, and kz ∈ N. Thus Eq. (45), becomes π (y) − π (x)(mod pm ) = zR + pn · P · (ky − kx ) + kz (mod pm ) = zR + pn · kt (mod pm ),

(46)

where kt = P · (ky − kx ) + kz (mod pm ). But the quantity zR + pn · kt (mod pm ) can be equal to 0 only if zR = 0(mod pn ). Since zR = 0(mod pn ), we have π (y) − π (x) = 0(mod pm ). “⇐” Let there be x, y ∈ WM with x and y fulfilling Eq. (43). Let there be xR = x(mod pn ) and yR = y(mod pn ). If xR = yR , we have again π (x)(mod pm ) = P · x + PxR (mod pm ) and π (y)(mod pm ) = P · y + PxR (mod pm ). Since P · x + PxR is a LPP modulo pm and π (x) = π (y)(mod pm ), it results that x = y(mod pm ). If xR = yR , we have x = xR + kx · pn and y = yR + ky · pn with kx , ky ∈ N. The quantity y − x(mod pm ) = yR − xR + pn · (ky − kx )(mod pm ) can be equal to 0 only if yR − xR = 0(mod pn ). Since xR = yR (mod pn ), it results that x = y(mod pm ). Thus the proof is completed. We note that if in Lemma 4.6 m < n, then, according to Lemma 4.4 and Proposition 4.3, case 4, we can have P · ( j − i) = Pi − Pj (mod pm ), for some i, j ∈ WR with i < j.

(47)

Eq. (47) leads to the fact that if x = y(mod pm ), it can result π (x) = π (y)(mod pm ). If m < n and xR = yR , we can have yR − xR = 0(mod pm ). Thus, if π (x) = π (y)(mod pm ) it can result x = y(mod pm ). From Lemma 4.6 it results that ARP interleavers can be used with butterfly networks in parallel turbo decoding. They provide the same easy way to compute the control bits as any PP interleaver if the number of component LPPs is equal to R = 2n and the number of the processors used (or the number of memory banks used for storing extrinsec values) is greater than or equal to R. However, if the ARP parameters fulfill the next conditions P · ( j − i) = Pi − Pj (mod 2m ), ∀i, j ∈ WR with i < j and i = j(mod 2m ), for a positive integer m < n, then the number of For example, let it be the ARP interleaver ⎧ 3x (mod 112), if x = 0(mod ⎪ ⎪ ⎨ 3x (mod 112), if x = 2(mod π (x) = 3 x + 3 ( mod 112) , if x = 3(mod ⎪ ⎪ ⎩ 3x + 1 (mod 112), if x = 4(mod

(48)

processors used can also be equal to 2m . 4), 4) 4) 4)

(49)

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

3155

This ARP has parameters P = 3, P0 = P1 = 0, P2 = 3 and P3 = 1. It can be checked that these parameters verify conditions in Lemma 4.4 for R = 4. However, these parameters do not verify conditions (48) for m = 1, i = 0, and j = 3, since 3 = 0(mod 2) and P · (3 − 0)(mod 2) = P0 − P3 (mod 2) = 1. Thus, the number of processors used cannot be equal to 2. But the ARP interleaver ⎧ 3x (mod 112), ifx = 0(mod 4), ⎪ ⎪ ⎨ 3x + 2 (mod 112), ifx = 2(mod 4) π (x) = (50) 3x (mod 112), ifx = 3(mod 4) ⎪ ⎪ ⎩ 3x + 6 (mod 112), ifx = 4(mod 4), has parameters P = 3, P0 = P2 = 0, P1 = 2 and P3 = 6, which verify both conditions in Lemma 4.4 for R = 4 and conditions (48) for m = 1. Thus, for this ARP interleaver the number of processors used can also be equal to 2. We note that if all parameters P0 , P1 , . . . , PR−1 are of the form k · 2n + k0 with k, k0 ∈ N, i.e. multiples of R = 2n plus a constant value, for m < n we have Pi − Pj = 0(mod 2m ), ∀i, j ∈ WR . Because P and 2m are relatively prime to each other, for i = j(mod 2m ) it results that P · ( j − i) = 0(mod 2m ). Thus, condition (48) is fulfilled ∀m < n. In [25] ARP interleavers for 172 lengths were proposed for LTE standard [41] as an alternative for QPP interleavers. These 172 ARP interleavers have R = 4 or R = 8 component LPPs and all parameters P0 , P1 , . . . , PR−1 are multiples of R. It results that for these ARP interleavers the number of processors used can be every number 2m with m ∈ N∗ so that 2m |N. The previous observations for R = 4 are also applied for 5 of the 12 ARP interleavers used in DVB-RCS standard [23] and for 13 of the 17 ARP interleavers used in WiMax standard [24] because their parameters P0 , P1 , P2 , and P3 are multiples of 4 plus 1. We also note that if the parameters of an ARP interleaver satisfy the conditions from Lemma 4.5 for R = 2m and kR = 2n−m , m < n, then the number of processors used can also be 2m . Indeed, with the same notations as in the proof of Lemma 4.5, we have Pi − Pj = PiR − PjR = 0(mod 2m ), ∀i, j ∈ WR with i = j(mod 2m ) or iR = jR . Condition P · ( j − i) = 0(mod 2m ) is fulfilled for the same reasons as above. Thus, for the value of m previously mentioned, condition (48) is fulfilled. 7 of the 12 ARP interleavers used in DVB-RCS standard [23] and 4 of the 17 ARP interleavers used in WiMax standard [24] fit to this case with m = 1 because their parameters fulfill conditions P0 = P1 = 1(mod 2), P0 = P2 = 1(mod 4), and P1 = P3 = 3(mod 4). Lemma 4.7. Let N be an even number and let nmax be the greatest positive integer so that (2nmax ) | N . Assume that π is an ARP on WN , as in Eq. (30) with R = 2n where n ∈ N and n ≤ nmax . Let m be a positive integer so that n ≤ m ≤ nmax . Then equation π (x + k2m−1 ) = π (x) + k(mod 2) · 2m−1 (mod 2m ) (51) holds ∀x ∈ WN and ∀k ∈ WN 1) for every ARP parameters if m > n. 2) for ARP parameters fulfilling conditions Pr = Pr+R/2 (mod R), ∀r ∈ WR/2

(52)

if m = n. Proof. If m ≥ n, for k an even number, Eq. (51) is obvious since x + k2m−1 (mod 2n ) = n m−1 m x(mod 2 ) and k2 (mod 2 ) = k(mod 2) · 2m−1 (mod 2m ) = 0.

3156

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

Table 1 The hardware implementation complexity of the memory reading circuit for different interconnection networks when using PP of degree d interleavers. Network structure

Crossbar network Master-slave Batcher network Benes network Barrel shifter network Butterfly network

Number of multiplexers

Number of full adders

For LLR Routing

For LLR reading

Next · (Next − 1) Next 2 2 · (log2 Next − − log2 Next + 4) − 2 Next · (2 log2 Next − 1) Next · log2 Next Next · log2 Next

d · Next Next 2 2 · (log2 Next − − log2 Next + 2d + 4) − 2 d · Next + 2 d · Next d · Next

1 2 Next · (Next + 4d − 1) Next 2 4 · (log2 Next − − log2 Next + 8d + 4) − 1

2d · Next + 4 Next 2 · (4d + 1) 2d · Next

We prove Eq. (51) for k an odd number, i.e. k = 2 · l + 1 with l ∈ N. We note that, since N is an even number and P is relatively prime with N, P is also an odd number, i.e. P = 2 · Q + 1 with Q ∈ N. We denote by xR the value of x modulo R. First we assume that m > n. We have x + k2m−1 (mod 2n ) = x + (2l + 1) · 2m−1 (mod 2n ) = x(mod 2n ) = xR . Then π (x + k2m−1 )(mod 2m ) = P · (x + k2m−1 ) + PxR (mod 2m ) = P · x + PxR + P · k2m−1 (mod 2m ) = π (x) + (2Q + 1) · (2l + 1) · 2m−1 (mod 2m ) = π (x) + 2m−1 (mod 2m ),

(53)

i.e. Eq. (51) for k an odd number. Now we assume that m = n. We have x + k2m−1 (mod 2n ) = x + (2l + 1) · 2m−1 (mod 2n ) = xR + 2n−1 (mod 2n ) = xR + R/2(mod R). Taking into account conditions (52), it results that PxR +R/2(mod R) = PxR . Then, Eq. (51) results similarly as in Eq. (53). 5. Implementation complexity analysis and examples 5.1. Implementation complexity analysis The scope of the present paper is beyond of the ASIC implementation of a parallel turbo decoder using butterfly networks. However, in this section we make a theoretical analysis from the point of view of the implementation complexity for the memory reading circuit when using butterfly networks for PP of degree d or ARP with R LPPs interleavers. The implementation complexity is done in terms of the number of 2-input multiplexers and the number of full adders. In this analysis we compare the butterfly network with four of the most known and used interconnection networks including the crossbar network [5], the master-slave Batcher network [6], the Benes network [7], and the barrel shifter network [8]. The interconnection network is assumed to route Next = 2n , n ∈ N∗ , extrinsec values. The number of 2-input multiplexers required for the previous mentioned interconnection networks and for the butterfly network is given in Table I from [7]. These expresions are given in terms of the number of extrinsec values in the second column in Tables 1 and 2. Here we recall the observation done in Section 1, that butterfly networks suitability for parallel turbo decoding in [7] is proved only for QPP interleavers and only for the classical partition of the

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

3157

Table 2 The hardware implementation complexity of the memory reading circuit for different interconnection networks when using ARP interleavers. Network structure

Crossbar network Master-slave Batcher network Benes network Barrel shifter network Butterfly network

Number of multiplexers

Number of full adders

For LLR Routing

For LLR reading

Next · (Next − 1) Next 2 2 · (log2 Next − − log2 Next + 4) − 2 Next · (2 log2 Next − 1) Next · log2 Next Next · log2 Next

Next Next 2

· (log22 Next − − log2 Next + 6) − 2 Next + 2 Next Next

1 2 Next · (Next + 3) Next 2 4 · (log2 Next − − log2 Next + 12) − 1

2 · Next + 4 5 2 · Next 2 · Next

whole turbo codeword in disjoint equally sized subblocks. The butterfly networks suitability for parallel turbo decoding is proved in [3] for QPP interleavers, and in the present paper for any degree PP interleavers and ARP interleavers with some constraints, for many types of partitions of the whole turbo codeword that gives much more flexibility in the turbo decoder design. We note that for a negligible configuration overhead of the barrel shifter network, the interleaver has to fulfill the maximally-decoupled logarithmic-ring-shift (MD-LRS) property [8]. For an interleaver described by the permutation function π (x): WN → WN , this means that if π (s · M + j) = Qs · M + r j , with M = N/Next , s, Qs ∈ WNext and j, r j ∈ WM ,

(54)

and (s) = Qs − s(mod Next ),

(55)

then (s + 2i ) = (s)(mod 2i+1 ), where i ∈ N, 2i+1 | Next .

(56)

This property is proved in Appendix A for at most fifth degree PP interleavers. For ARP interleavers, MD-LRS property is analyzed in Appendix B. It is shown that this property is not always fulfilled, even when the number of component LPPs, R, is a power of two. This can lead to a configuration overhead of the barrel shifter networks when using ARP interleavers. For the generation of the physical interleaved addresses, we use the results presented in section V.B from Rosnes [39] and in Section VII from Trifina and Tarniceriu [19]. We note that, when using butterfly networks, for the ith extrinsec value, with i ∈ WN , the memory index is π (i)(mod Next ) and the interleaved address is (π (i)(mod N))/Next . If we firstly compute the value π (i)(mod N) on log2 (N) bits, then the memory index is obtained by retaining the last log2 (Next ) bits and the interleaved address is obtained by retaining the first log2 (N ) − log2 (Next ) bits. To compute Next values π (i)(mod N), from Trifina and Tarniceriu [19] it results that for a PP interleaver of degree d we require d + 1 precomputed and stored constants, d · Next additions and d · Next modulo N operations. For an ARP interleaver with R component LPPs, we require R + 1 precomputed and stored constants, Next additions and Next modulo N operations.

3158

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

A modulo N operation can be implemented using one 2-input multiplexer and one full adder (on log2 (N) bits). Thus, for the generation of the physical interleaved addresses, when using a PP of degree d interleaver, we require d · Next multiplexers and 2d · Next full adders on log2 (N) bits. When using an ARP interleaver, for the generation of the physical interleaved addresses we require Next multiplexers and 2 · Next full adders on log2 (N) bits. We note that for LLR reading with crossbar networks, 21 Next · (Next − 1) additional full adders are required. Master-slave Batcher networks have a master network consisting of N4ext · (log22 Next − log2 Next + 4) − 1 2-input sorter nodes and a slave network consisting of N4ext · (log22 Next − log2 Next + 4) − 1 2 × 2 switch nodes. The master network generates the control bits for the slave network. Because a 2-input sorter node consists in two 2-input multiplexers and one full adder, for master-slave Batcher networks N2ext · (log22 Next − log2 Next + 4) − 2 additional 2input multiplexers and N4ext · (log22 Next − log2 Next + 4) − 1 additional full adders are required for LLR reading. For Benes networks configuration, two additions and two modulo operations are additionally required. The values (s) from Eq. (55) represent the control bits of the 2-input multiplexers from the barrel shifter networks. The Next values are periodic with period Next /2. Thus, only Next /2 values have to be computed. Therefore, for the configuration of the barrel shifter networks, when MD-LRS property is fulfilled, Next /2 additional full adders are required. The number of 2-input multiplexers and the number of full adders required for LLR routing for the five considered interconnection networks, when using PP of degree d and ARP interleavers, are given in the third and the fourth columns in Tables 1 and 2, respectively. We observe that for PP of degree d interleavers the number of 2-input multiplexers and the number of full adders for LLR routing increases with both the number of routed extrinsec values, Next , and the degree of PP, d, while for ARP interleavers these numbers increase only with the number of routed extrinsec values, Next . This is due to the fact that ARP interleavers are composed only by several LPPs, and, as a consequence, they allow efficient generation of the interleaved addresses. Therefore, ARP representation of PPs, as in [32] for QPPs and as in [19] for CPPs, or PLPP representation as in [38], if the number of component LPPs can be a power of two, is much more efficient for implementation.

5.2. Examples In this section we exemplify the determination of control bits of a 16 × 16 butterfly network for a 5-PP and an ARP interleaver. We consider the five degree PP (5-PP) interleaver π (x) = 65x + 38x 2 + 16x 3 + 10x 4 + 12x 5 (mod 112)

(57)

of length N = 112 found by the method presented in [40]. This 5-PP interleaver also leads to optimum minimum distance as that given in [20], when using LTE turbo codes [41], but offers better performance since it is optimized by distance spectrum with nine terms not only by the first term.

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

3159

Table 3 FER performances on AWGN channel at SNR = 4 dB for 5-PP, ARP and LTE-QPP interleavers of length 112. Interleaver

FER × 107

65x + 38x 2 + 16x 3 + 10x 4 + 12x 5 (mod 112) 15x (mod 112), if x = 0(mod 2), 15x + 54 (mod 112), if x = 1(mod 2). 41x + 84x 2 (mod 112) (LTE-QPP)

4.336 3.567 8.782

As an example for ARP interleavers, we consider the next ARP with two component LPPs (mod 112), if x = 0(mod 2), 15x π (x) = (58) 15x + 54 (mod 112), if x = 1(mod 2). This ARP interleaver is given in [21] as a dithered LPP interleaver with the coefficients of linear terms equal to each other. We see that the ARP parameters are P = 15, P0 = 0, and P1 = 54, and they fulfill condition (34) for i = 0, j = 1, and R = 2. The performances of the above 5-PP and ARP (PLPP) interleavers and of the LTE interleaver of length 112 [41] are shown in Table 3 in terms of frame error rate (FER) at high signal to noise ratio (SNR). An additive white Gaussian noise (AWGN) channel was assumed and the MAP algorithm was used in turbo decoding. From Table 3 we see that the 5-PP interleaver leads to better FER performance compared to LTE-QPP interleaver, and the ARP interleaver is better than the 5-PP interleaver. We note that when using several pipelined units to process the trellis of the component convolutional codes of the turbo code, a tradeoff must be done between the maximum allowed frequency and the critical path delay in the implemented circuit. Because using more pipelined units leads to an increase of the critical path delay, as a consequence, the maximum allowed frequency is smaller. For the interleaver length N = 112, the maximum degree of parallelism when using butterfly networks is L = 24 = 16. Firstly, we consider the address vectors as in the first example from Nieminen [3], i.e. ai (k) = 16k + i, ∀i ∈ W16 and ∀k ∈ W7 .

(59)

We note that these address vectors are valid for using eight pipelined radix-4 units which process continuously the whole turbo block of length 112. In the proof given in [9] the only address vectors, for L = 16, are ai (k) = 7 · i + k, ∀i ∈ W16 and ∀k ∈ W7 . Thus, the trellis processing must be done on 16 disjoint subblocks of length 7. For the address vectors considered above and for 5-PP interleaver from Eq. (57), the interleaved addresses are mapped to the memories π (i) = i + 6i2 + 10i4 + 12i5 (mod 16) ∀i ∈ W16 . The interleaved address vectors π (ai (k)) are given in Table 4 and the physical interleaved address vectors are given in Table 5. π (i) in Table 5 shows the index of the accessed memory for physical interleaved addresses. In the last row “cb” gives the control bits for the 16 × 16 butterfly network. We observe that these control bits follow Eq. (29) for n = 4. For the same above address vectors and for ARP interleaver from Eq. (58), the interleaved address vectors π (ai (k)) are given in Table 6 and the physical interleaved address vectors are given in Table 7. If for eighth radix-4 units, the critical path delay is too large, we can consider two subblocks, with each subblock processed by four pipelined radix-4 units. If we consider a XMAP

3160

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

Table 4 The interleaved addresses for 5-PP interleaver π (x) = 65x + 38x 2 + 16x 3 + 10x 4 + 12x 5 (mod 112). π (16k + i) for i = 0, 1, . . . , 15 k

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

0 1 2 3 4 5 6

0 16 80 32 64 96 48

29 61 13 77 93 45 109

58 10 74 106 26 90 42

103 55 7 23 87 39 71

52 4 36 68 20 84 100

97 49 65 17 81 1 33

46 78 110 62 14 30 94

91 107 59 11 43 75 27

8 40 104 56 72 24 88

37 101 53 85 5 69 21

82 34 98 2 66 18 50

31 95 15 47 111 63 79

76 28 44 108 60 92 12

25 57 89 41 105 9 73

70 86 38 102 22 54 6

99 19 83 35 51 3 67

Table 5 The physical interleaved addresses for 5-PP interleaver π (x) = 65x + 38x 2 + 16x 3 + 10x 4 + 12x 5 (mod 112). π (16k + i)/16 for i = 0, 1, . . . , 15 k

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

0 1 2 3 4 5 6 π (i) cb

0 1 5 2 4 6 3 0 0

1 3 0 4 5 2 6 13 12

3 0 4 6 1 5 2 10 8

6 3 0 1 5 2 4 7 4

3 0 2 4 1 5 6 4 0

6 3 4 1 5 0 2 1 4

2 4 6 3 0 1 5 14 8

5 6 3 0 2 4 1 11 12

0 2 6 3 4 1 5 8 0

2 6 3 5 0 4 1 5 12

5 2 6 0 4 1 3 2 8

1 5 0 2 6 3 4 15 4

4 1 2 6 3 5 0 12 0

1 3 5 2 6 0 4 9 4

4 5 2 6 1 3 0 6 8

6 1 5 2 3 0 4 3 12

Table 6 The interleaved addresses for ARP interleaver in Eq. (58). π (16k + i) for i = 0, 1, . . . , 15 k

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

0 1 2 3 4 5 6

0 16 80 32 64 96 48

29 61 13 77 93 45 109

58 10 74 106 26 90 42

103 55 7 23 87 39 71

52 4 36 68 20 84 100

97 49 65 17 81 1 33

46 78 110 62 14 30 94

91 107 59 11 43 75 27

8 40 104 56 72 24 88

37 101 53 85 5 69 21

82 34 98 2 66 18 50

31 95 15 47 111 63 79

76 28 44 108 60 92 12

25 57 89 41 105 9 73

70 86 38 102 22 54 6

99 19 83 35 51 3 67

instead of a SMAP component decoder then, for L = 16, we require only two pipelined radix-4 units for each subblock. In this case we can use the address vectors: ⎧ ai (k) = 4 · (6 − k) + i, ⎪ ⎪ ⎨ a4+i (k) = 4 · (7 + k) + i, (60) a8+i (k) = 4 · (20 − k) + i, ⎪ ⎪ ⎩ a12+i (k) = 4 · (21 + k) + i, ∀i ∈ W4 and ∀k ∈ W7 . These vectors are uniformly 16-dyadic and we can use a 16 × 16 butterfly network to permute the extrinsec values computed by the component decoders. Trellis processing will be

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

3161

Table 7 The physical interleaved addresses for ARP interleaver in Eq. (58). π (16k + i)/16 for i = 0, 1, . . . , 15 k

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

0 1 2 3 4 5 6 π (i) cb

0 1 5 2 4 6 3 0 0

1 3 0 4 5 2 6 13 12

3 0 4 6 1 5 2 10 8

6 3 0 1 5 2 4 7 4

3 0 2 4 1 5 6 4 0

6 3 4 1 5 0 2 1 4

2 4 6 3 0 1 5 14 8

5 6 3 0 2 4 1 11 12

0 2 6 3 4 1 5 8 0

2 6 3 5 0 4 1 5 12

5 2 6 0 4 1 3 2 8

1 5 0 2 6 3 4 15 4

4 1 2 6 3 5 0 12 0

1 3 5 2 6 0 4 9 4

4 5 2 6 1 3 0 6 8

6 1 5 2 3 0 4 3 12

Table 8 The addresses vectors from Eq. (60). 4 · (6 − k) + i

4 · (7 + k) + i

4 · (20 − k) + i

4 · (21 + k) + i

i pin

0 0

1 1

2 2

3 3

0 4

1 5

2 6

3 7

0 8

1 9

2 10

3 11

0 12

1 13

2 14

3 15

k=0 1 2 3 4 5 6

24 20 16 12 8 4 0

25 21 17 13 9 5 1

26 22 18 14 10 6 2

27 23 19 15 11 7 3

28 32 36 40 44 48 52

29 33 37 41 45 49 53

30 34 38 42 46 50 54

31 35 39 43 47 51 55

80 76 72 68 64 60 56

81 77 73 69 65 61 57

82 78 74 70 66 62 58

83 79 75 71 67 63 59

84 88 92 96 100 104 108

85 89 93 97 101 105 109

86 90 94 98 102 106 110

87 91 95 99 103 107 111

Table 9 The interleaved addresses for 5-PP interleaver π (x) = 65x + 38x 2 + 16x 3 + 10x 4 + 12x 5 (mod 112) when the address vectors are from Eq. (60). π (4 · (6 − k) + i)

π (4 · (7 + k) + i)

π (4 · (20 − k) + i)

π (4 · (21 + k) + i)

i pin

0 0

1 1

2 2

3 3

0 4

1 5

2 6

3 7

0 8

1 9

2 10

3 11

0 12

1 13

2 14

3 15

0 1 2 3 4 5 6

40 4 16 76 8 52 0

101 49 61 25 37 97 29

34 78 10 70 82 46 58

95 107 55 99 31 91 103

28 80 36 104 44 32 68

57 13 65 53 89 77 17

86 74 110 98 38 106 62

19 7 59 15 83 23 11

96 60 72 20 64 108 56

45 105 5 81 93 41 85

90 22 66 14 26 102 2

39 51 111 43 87 35 47

84 24 92 48 100 88 12

1 69 9 109 33 21 73

30 18 54 42 94 50 6

75 63 3 71 27 79 67

continuously in each of the two subblocks. We note that with address vectors from Takeshita [9] the trellis processing must be done over four disjoint subblocks because a radix-4 XMAP decoder process four bits in the same time. The considered address vectors from Eq. (60) are given in Table 8. The interleaved address vectors for 5-PP interleaver from Eq. (57) and for ARP interleaver from Eq. (58), are given in Tables 9 and 10, respectively. In Tables 11 and 12, in the first

3162

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

Table 10 The interleaved addresses for ARP interleaver in (58) when the address vectors are from Eq. (60). π (4 · (6 − k) + i)

π (4 · (7 + k) + i)

π (4 · (20 − k) + i)

π (4 · (21 + k) + i)

i pin

0 0

1 1

2 2

3 3

0 4

1 5

2 6

3 7

0 8

1 9

2 10

3 11

0 12

1 13

2 14

3 15

0 1 2 3 4 5 6

24 76 16 68 8 60 0

93 33 85 25 77 17 69

54 106 46 98 38 90 30

11 63 3 55 107 47 99

84 32 92 40 100 48 108

41 101 49 109 57 5 65

2 62 10 70 18 78 26

71 19 79 27 87 35 95

80 20 72 12 64 4 56

37 89 29 81 21 73 13

110 50 102 42 94 34 86

67 7 59 111 51 103 43

28 88 36 96 44 104 52

97 45 105 53 1 61 9

58 6 66 14 74 22 82

15 75 23 83 31 91 39

Table 11 Input and output pins for 5-PP interleaver π (x) = 65x + 38x 2 + 16x 3 + 10x 4 + 12x 5 (mod 112) when the adresses vectors are from Eq. (60). pin

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

k=0 1 2 3 4 5 6

8/8 4/4 0/0 12/12 8/8 4/4 0/0

9/8 5/4 1/0 13/12 9/8 5/4 1/0

10/8 6/4 2/0 14/12 10/8 6/4 2/0

11/8 7/4 3/0 15/12 11/8 7/4 3/0

12/8 0/4 4/0 8/12 12/8 0/4 4/0

13/8 1/4 5/0 9/12 13/8 1/4 5/0

14/8 2/4 6/0 10/12 14/8 2/4 6/0

15/8 3/4 7/0 11/12 15/8 3/4 7/0

0/8 12/4 8/0 4/12 0/8 12/4 8/0

1/8 13/4 9/0 5/12 1/8 13/4 9/0

2/8 14/4 10/0 6/12 2/8 14/4 10/0

3/8 15/4 11/0 7/12 3/8 15/4 11/0

4/8 8/4 12/0 0/12 4/8 8/4 12/0

5/8 9/4 13/0 1/12 5/8 9/4 13/0

6/8 10/4 14/0 2/12 6/8 10/4 14/0

7/8 11/8 15/0 3/12 7/8 11/8 15/0

Table 12 Input and output pins for ARP interleaver in Eq. (58) when the adresses vectors are from Eq. (60). pin

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

k=0 1 2 3 4 5 6

8/8 12/12 0/0 4/4 8/8 12/12 0/0

13/12 1/0 5/4 9/8 13/12 1/0 5/4

6/4 10/8 14/12 2/0 6/4 10/8 14/12

11/8 15/12 3/0 7/4 11/8 15/12 3/0

4/0 0/4 12/8 8/12 4/0 0/4 12/8

9/12 5/0 1/4 13/8 9/12 5/0 1/4

2/4 14/8 10/12 6/0 2/4 14/8 10/12

7/0 3/4 15/8 11/12 7/0 3/4 15/8

0/8 4/12 8/0 12/4 0/8 4/12 8/0

5/12 9/0 13/4 1/8 5/12 9/0 13/4

14/4 2/8 6/12 10/0 14/4 2/8 6/12

3/8 7/12 11/0 15/4 3/8 7/12 11/0

12/0 8/4 4/8 0/12 12/0 8/4 4/8

1/12 13/0 9/4 5/8 1/12 13/0 9/4

10/4 6/8 2/12 14/0 10/4 6/8 2/12

15/0 11/4 7/8 3/12 15/0 11/4 7/8

row the input pins are given and in the other rows, the values o/cb represent the output pin and the control bits, respectively, for every k ∈ W7 and for every input pin ∈ W16 . In Tables 13 and 14 we give the hardware implementation complexity of the memory reading circuit for the two considered interleavers and for the five interconnection networks analyzed in Section 5.1. We see that butterfly networks are the most efficient in terms of the number of 2-input multiplexers and the number of full adders. Barrel shifter networks have similar complexity, with only 8 additional full adders required. But, the major advantage of the butterfly networks consists in the much more allowed flexibility for extrinsec values routing.

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

3163

Table 13 The hardware implementation complexity of the memory reading circuit for different interconnection networks when using 5-PP interleaver π (x) = 65x + 38x 2 + 16x 3 + 10x 4 + 12x 5 (mod 112). Network structure

Crossbar network Master-slave Batcher network Benes network Barrel shifter network Butterfly network

Number of multiplexers

Number of

For LLR Routing

For LLR reading

full adders

240 126

80 206

280 223

496 64

82 80

164 168

64

80

160

Table 14 The hardware implementation complexity of the memory reading circuit for different interconnection networks when using ARP interleaver in Eq. (58). Network structure

Number of multiplexers

Number of

For LLR Routing

For LLR reading

full adders

Crossbar network Master-slave Batcher network Benes network Barrel shifter network Butterfly network

240 126

16 142

152 95

496 64

18 16

36 40

64

16

32

6. Conclusions In this paper we prove that parallel decoding of turbo codes with any degree PP and ARP interleavers can be performed using butterfly networks, allowing the same easy way to compute the control bits as it was shown for QPP interleavers in [3]. The usefulness of the results in the paper consists in the possibility of computing “on the fly” the control bits when using turbo codes with these performant algebraic interleavers. The result for any degree PP interleavers is general, in the sense that any number equal to a power of two, dividing the interleaver length, can be used for the number of processors used in parallel turbo decoding. The result for ARP interleavers is slightly less general since ARP was assumed to consist of a number of LPPs R equal to a power of two. For any valid set of parameters of a such ARP, the number of the processors used in parallel turbo decoding can be any number equal to a power of two, dividing the interleaver length, greater than or equal to the number of component LPPs. If the ARP parameters fulfill some more constraints (see Eq. (48)), then the number of the processors used can be smaller than the number of component LPPs. We have shown that condition (48) is always fulfilled if the free terms of the component LPPs of ARP are multiples of R plus a constant, as it is the case of ARP interleavers proposed in [25] as an alternative for LTE standard. The same observation applied for some of the ARP interleavers used in DVB-RCS [23] and WiMax [24] standards. We have shown that for the remaining interleavers from these standards the number of the processors

3164

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

used can be smaller than the number of component LPPs because their parameters fulfill the conditions in Lemma 4.5. Ref. [42,43] provide the parameters for good ARP interleavers with R = 4 for the length of 752. We note that for the best ARP interleavers reported in [42,43] the number of the processors used can also be equal to two. We have also compared the implementation complexity of butterfly networks with four other known interconnection networks used for parallel turbo decoding. We have shown that butterfly networks and barrel shifter networks achieve the lowest complexity in terms of the number of 2-input multiplexers and the number of full adders required for implementation. But, butterfly networks allow a much more flexibility in extrinsec values routing design. Additionally, ARP interleavers allow a lower complexity for generation of the interleaved addresses compared to PP interleavers of degree higher than or equal to two. Therefore, ARP [19,32] or PLPP [38] representation of PP interleavers would be desired for implementation. Acknowledgement We thank the editor and the reviewers for their helpful comments and suggestions which greatly improved the quality and the presentation of this paper. Appendix A. MD-LRS property for PP of degree d interleavers (d ≤ 5) For π (x) as in (4), for x = s · M + j, as in Eq. (18), we similarly have π (s · M + j) = π (s · M ) + q1 · j +

d

qk ·

k=2

= π (s · M ) + π ( j) +

d

k−1 l=0

qk ·

k−1

k=2

Ckl

· (s · M ) · j l

k−l

(mod N )

Ckl · sl · M l−1 · j k−l (mod N )

l=1

= Qs · M + r j (mod N ),

(61)

r j = π ( j) − · M, with = π ( j)(mod N )/M.

(62)

where

and Qs =

d

qk · sk · M k−1 +

k=1

d k=2

qk ·

k−1

Ckl · sl · M l−1 · j k−l

+ (mod Next )

(63)

l=1

Then (s) from Eq. (55) results as (s) = Qs − s(mod Next )

d d k−1 = (q1 − 1) · s + qk · sk · M k−1 + qk · Ckl · sl · M l−1 · j k−l + (mod Next ) k=2

k=2

l=1

(64)

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

3165

(s + 2i ) results as (s + 2i ) = Qs+2i − (s + 2i )(mod Next ) = (q1 − 1) · (s + 2i ) + +

d

qk ·

k−1

k=2

d

qk · (s + 2i )k · M k−1

k=2

Ckl

· (s + 2 ) · M i l

l−1

·j

+ (mod Next )

k−l

(65)

l=1

After some algebraic manipulations, we obtain (s + 2 ) = (s) + (q1 − 1) · 2 + i

i

+

d

qk ·

k−1

k=2

d

qk · M

k=2

Ckl

·M

l−1

·j

k−l

·

k−1

·

k−1

l−1

Ckl

·s ·2 l

l=0

Clt

·s ·2 t

i·(l−t )

i·(k−l )

(mod Next )

(66)

t=0

l=1

When 4|N, as for the interleaver lengths in LTE standard, the PP coefficients have to fulfill the conditions from Theorem 3.4. Because q1 = 1(mod 2) we have (q1 − 1) · 2i (mod 2i+1 ) = 0. Now we have

d k−1 k−1 l l i·(k−l ) (mod 2i+1 ) qk · M · Ck · s · 2 (67) k=2

l=0

=

d

qk · M

k−1

·

k−2

k=2

Ckl

·s ·2 l

i·(k−l )

+

l=0

d k=2

=0(mod 2i+1 )

qk · (sM )k−1 · k · 2i (mod 2i+1 )

= 2i · (2 · q2 · sM + 2 · q3 · (sM )2 + 4 · q4 · (sM )3 + 4 · q5 · (sM )4 + · · · ) =0(mod 2i+1 )

+ 2 · (q3 · (sM ) + q5 · (sM ) + q7 · (sM )6 + · · · ) = 0 i

2

4

=0(mod

(67)

2i+1 )

We note that if sM is even then (sM)k with k ∈ N is also even, and if sM is odd then (sM)k is also odd. Then, the last result from Eq. (67) is true because from Theorem 3.4 we have b · (q3 + q5 + q7 + · · · )(mod 2) = 0, where b = (sM )k (mod 2), k ∈ N. d

qk ·

k−1

k=2

Ckl

·M

·j

l−1

k−l

·

=

qk · k · j

k=2

k−1

·2 + i

=0(mod 2i+1 ) (as in (67))

+2 · i

Clt

·s ·2 t

i·(l−t )

(mod 2i+1 )

t=0

l=1 d

l−1

d k=3

qk ·

k−1 l=2

Ckl

d

qk ·

k−1

k=3

Ckl

·M

l=2

· (sM )

·j

k−l

·j

k−l

l−1

l−1

·

l−2

=0(mod 2i+1 )

(mod 2i+1 )

t=0

Clt

·s ·2 t

i·(l−t )

(68)

3166

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

l l−1 The parity of the quantity k−1 · j k−l is difficult to assess for a general k. l=2 Ck · (sM ) Therefore we will restrict for at most fifth degree of PP interleavers. To our knowledge, so far there are no PP interleavers studied for turbo codes of degree greater than 5 (see [20] for the most recent results in this sense). For d = 5 we have 2i ·

d

qk ·

k=3

k−1

Ckl · (sM )l−1 · j k−l (mod 2i+1 )

l=2

= 2 · 3 · q3 · (sM ) · j + q4 · C42 · (sM ) · j 2 + 4 · (sM )2 · j i

+q5 ·

=0(mod 2)

C52

· (sM ) · j

3

+ C52

· (sM ) · j +5 · (sM )3 · j 2

2

(mod 2i+1 )

=0(mod 2)

= 2 · q3 · (sM ) · j + q5 · (sM )3 · j (mod 2i+1 ) = 2i · (sM ) · j · q3 + q5 · (sM )2 (mod 2i+1 ) i

(69)

k Using quantity (sM ) = b(mod 2) for every k∈ N we have for b = 0, (sM ) · the fact that2the j · q3 +q5 · (sM ) = 0(mod 2), and for b = 1, (sM ) · j · q3 + q5 · (sM )2 (mod 2) = j · q3 + q5 = 0(mod 2) from Theorem 3.4. Thus for at most fifth degree PP interleavers, MDLRS property is fulfilled.

Appendix B. MD-LRS property for ARP interleavers with R component LPPs For π (x) as in Eq. (4.1), for x = s · M + j, we have π (s · M + j) = P · (s · M + j) + Pls, j (mod N ) = Qs · M + rs, j (mod N ),

(70)

where rs, j = P · j + Pls, j − s · M,

(71)

with ls, j = sM + j(mod R),

(72)

and Qs = P · s + s (mod Next ).

(73)

Then (s) = Qs − s = (P − 1) · s + s (mod Next ).

(74)

Similarly π ((s + 2i ) · M + j) = P · (s + 2i ) · M + j + Plsi , j (mod N ) = Qs+2i · M + rsi , j (mod N ), (75) where rsi , j = P · j + Plsi , j − si · M,

(76)

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

3167

with lsi , j = sM + j + 2i · M(mod R) = ls, j + 2i · M(mod R),

(77)

and Qs+2i = P · (s + 2i ) + si (mod Next ).

(78)

Then (s + 2i ) = Qs+2i − (s + 2i ) = (P − 1) · s + (P − 1) · 2i + si = (s) + (P − 1) · 2i + si − s .

(79)

Because for a valid ARP gcd (P, N ) = 1, it results that (P − 1) · 2i = 0(mod 2i+1 ), when 2| N. However si − s (mod 2i+1 ) is not necessary equal to 0 for a valid ARP. A sufficient condition so that si − s = 0(mod 2i+1 ) is that Plsi , j = Pls , j . This condition is always fulfilled when lsi , j = ls, j , which, from Eq. (77), is equivalent to R|(2i · M). When R = 2nR , nR ∈ N∗ , this condition is always true if i ≥ log2 (R). Thus, for ARP interleavers some configuration overhead can appear when using barrel shifter networks. References [1] A. Tarable, S. Benedetto, G. Montorsi, Mapping interleaving laws to parallel turbo and LDPC decoder architectures, IEEE Trans. Inf. Theory 50 (9) (2004) 2002–2009. [2] E. Nieminen, A contention-free parallel access by butterfly networks for turbo interleavers, IEEE Trans. Inf. Theory 60 (1) (2014) 237–251. [3] E. Nieminen, On quadratic permutation polynomials, turbo codes, and butterfly networks, IEEE Trans. Inf. Theory 63 (9) (2017) 5793–5801. [4] J. Sun, O.Y. Takeshita, Interleavers for turbo codes using permutation polynomials over integer rings, IEEE Trans. Inf. Theory 51 (1) (2005) 101–119. [5] S.M. Karim, I. Chakrabarti, High-throughput turbo decoder using pipelined parallel architecture and collision-free interleaver, IET Commun. 6 (11) (2012) 1416–1424. [6] C. Studer, C. Benkeser, S. Belfanti, Q. Huang, Design and implementation of a parallel turbo-decoder ASIC for 3GPP-LTE, IEEE J. Solid State Circuits 46 (1) (2011) 8–17. [7] M. Broich, T.G. Noll, Efficient VLSI architecture of QPP interleavers for LTE turbo decoders, in: Proceedings of the IEEE International Symposium on System-on-Chip, 2012. [8] C.C. Wong, H.C. Chang, Reconfigurable turbo decoder with parallel architecture for 3GPP LTE system, IEEE Trans. Circuits Syst. II Exp. Briefs 57 (7) (2010) 566–570. [9] O.Y. Takeshita, On maximum contention-free interleavers and permutation polynomials over integer rings, IEEE Trans. Inf. Theory 52 (3) (2006) 1249–1253. [10] J. Wang, K. Zhang, H. Kröll, J. Wei, Design of QPP interleavers for the parallel turbo decoding architecture, IEEE Trans. Circuits Syst. I Reg. Papers 63 (2) (2016) 288–299. [11] Y.L. Chen, J. Ryu, O.Y. Takeshita, A simple coefficient test for cubic permutation polynomials over integer rings, IEEE Commun. Lett. 10 (7) (2006) 549–551. [12] H. Zhao, P. Fan, A note on “A simple coefficient test for cubic permutation polynomials over integer rings”, IEEE Commun. Lett. 11 (12) (2007) 991. [13] O.Y. Takeshita, Permutation polynomial interleavers: an algebraic-geometric perspective, IEEE Trans. Inf. Theory 53 (6) (2007) 2116–2132. [14] G. Weng, C. Dong, A note on permutation polynomials over Zn , IEEE Trans. Inf. Theory 54 (9) (2008) 4388–4390. [15] J. Ryu, Permutation polynomials of higher degrees for turbo code interleavers, IEICE Trans. Commun. E95-B (12) (2012) 3760–3762. [16] L. Trifina, D. Tarniceriu, Analysis of cubic permutation polynomials for turbo codes, Wirel. Person. Commun. 69 (1) (2013) 1–22.

3168

L. Trifina and D. Tarniceriu / Journal of the Franklin Institute 356 (2019) 3139–3168

[17] L. Trifina, D. Tarniceriu, A coefficient test for fourth degree permutation polynomials, AEU Int. J. Electron. Commun. 70 (11) (2016) 1565–1568. [18] L. Trifina, D. Tarniceriu, A coefficient test for quintic permutation polynomials over integer rings, IEEE Access (2018) 37893–37909. [19] L. Trifina, D. Tarniceriu, On the equivalence between cubic permutation polynomial interleavers and ARP interleavers for turbo codes, IEEE Trans. Commun. 65 (2) (2017) 473–485. [20] L. Trifina, J. Ryu, D. Tarniceriu, Up to five degree permutation polynomial interleavers for short length LTE turbo codes with optimum minimum distance, in: Proceedings of the 15th IEEE International Symposium Signals Circuits System. (ISSCS 2017), Iasi, Romania, 2017. [21] J. Ryu, L. Trifina, H. Balta, The limitation of permutation polynomial interleavers for turbo codes and a scheme for dithering permutation polynomials, AEU Int. J. Electron. Commun. 69 (10) (2015) 1550–1556. [22] C. Berrou, Y. Saoter, C. Douillard, S. Kerouedan, M. Jezequel, Designing good permutations for turbo codes: towards a single model, in: Proceedings of the IEEE International Conference on Communications (ICC’04), volume 1, Paris, France, 2004, pp. 341–345. [23] ETSI EN 301 790 V1.3.1, Digital Video Broadcasting (DVB); Interaction channel for satellite distribution systems,. http:// www.broadcasting.ru/ pdf- standard- specifications/ interactivity/ dvb-rcs/ en301790.v1.3.1.pdf. Accessed 12 September 2014. [24] IEEE Standard for Air Interface for Broadband Wireless Access Systems, IEEE Standard 802.16, 2012. [25] R1-070061, Motorola, ARP Interleaver Design for LTE, 3GPP RAN1#47bis, Sorrento, Italy, Jan. 14-19, 2007. http:// www.3gpp.org/ ftp/ tsg_ran/ WG1_RL1/ TSGR1_47bis/ Docs/ R1-070061.zip. Accessed 24 October 2017. [26] C.E. Shannon, A mathematical theory of communication, Bell Syst. Tech. J. 27 (3) (1948) 379–423. [27] C.E. Shannon, A mathematical theory of communication, Bell Syst. Tech. J. 27 (4) (1948) 623–656. [28] D.J.C. MacKay, R.M. Neal, Near shannon limit performance of low density parity check codes, Electron. Lett. 32 (18) (1996) 457–458. [29] E. Arikan, Channel polarization: a method for constructing capacity-achieving codes for symmetric binary-input memoryless channels, IEEE Trans. Inf. Theory 55 (7) (2009) 3051–3073. [30] R.G. Maunder, The 5g channel code contenders, CTO accelercomm, 2016, https:// eprints.soton.ac.uk/ 401809/ 1/ WhitePaper2.pdf. Accessed 2 July 2018. [31] S. Crozier, P. Guinand, High-performance low-memory interleaver banks for turbo-codes, in: Proceedings of the IEEE 54th Vehicular Technology Conference (VTC 2001-Fall), Atlantic City, NJ, USA, 2001, pp. 2394–2398. [32] R.G. Bohorquez, C.A. Nour, C. Douillard, On the equivalence of interleavers for turbo codes, IEEE Wirel. Commun. Lett. 4 (1) (2015) 58–61. [33] D.H. Lawrie, Access and alignment of data in an array processor, IEEE Trans. Comput. C-24 (12) (1975) 1145–1155. [34] W. Nöbauer, Über permutationspolynome und permutationsfunktionen für primzahlpotenzen, Monatsh. Math. 69 (3) (1965) 230–238. [35] G.L. Mullen, H. Stevens, Polynomial functions mod m), Acta Math. Hungar. 44 (3-4) (1984) 237–241. [36] G.H. Hardy, E.M. Wright, An Introduction to the Theory of Numbers, Fourth ed., Oxford University Press, 1975. [37] R.L. Rivest, Permutation polynomials modulo 2w , Finite Fields Appl. 7 (2) (2001) 287–292. [38] J. Ryu, Efficient address generation for permutation polynomial based interleavers over integer rings, IEICE Trans. Fundam. E95-A (1) (2012) 421–424. [39] E. Rosnes, On the minimum distance of turbo codes with quadratic permutation polynomial interleavers, IEEE Trans. Inf. Theory 58 (7) (2012) 4781–4795. [40] L. Trifina, D. Tarniceriu, Improved method for searching interleavers from a certain set using Garello’s method with applications for the LTE standard, Ann. Telecommun. 69 (5-6) (2014) 251–272. [41] 3GPP TS 36.212 V8.3.0, 3rd Generation Partnership Project, Multiplexing and channel coding (Release 8), http: // www.etsi.org/ deliver/ etsi_ts/ 136200_136299/ 136212/ 08.03.00_60/ ts_136212v080300p.pdf. Accessed 05 June 2015. [42] H. Balta, M. Kovaci, M. Nafornita, M. Balta, Multi-binary turbo-code design based on convergence of iterative turbo-decoding process, in: Proceedings of the 5th Conference on Circuits and Systems for Communications (ECCSC’10), Belgrade, Serbia, 2010, pp. 240–243. [43] H. Balta, M. Kovaci, A. Isar, M. Nafornita, M. Balta, ARP and QPP interleavers selection based on the convergence of iterative decoding process for the construction of 16-state duo binary turbo codes, in: Proceedings of the 34th International Conference on Telecommunications and Signal Processing (TSP 2011), Budapest, Hungary, 2011, pp. 116–120.

Parallel access by butterfly networks for any degree permutation polynomial and ARP interleavers

Parallel access by butterfly networks for any degree permutation polynomial and ARP interleavers

Recommend Documents