Improved DCT–DST prime factor algorithms

Improved DCT–DST prime factor algorithms

Signal Processing 81 (2001) 335}343 Improved DCT}DST prime factor algorithms Gideon Kutz, Hanoch Ur* Faculty of Engineering, Department of Electrical...

196KB Sizes 0 Downloads 88 Views

Signal Processing 81 (2001) 335}343

Improved DCT}DST prime factor algorithms Gideon Kutz, Hanoch Ur* Faculty of Engineering, Department of Electrical Engineering-Systems, Tel-Aviv University, P.O. Box 39040, Ramat Aviv, Tel Aviv 69978, Israel Received 6 January 2000; received in revised form 3 August 2000

Abstract E$cient algorithms, based on prime factor decomposition, for the calculation of the discrete cosine transform (DCT) and the discrete sine transform (DST) are presented. The proposed algorithms are an extension to the previously published algorithms and include corrections to inaccuracies found therein. Two hardware architectures, based on the proposed algorithms, are presented and shown to be superior, in terms of complexity, throughput and latency, to the existing hardware implementations.  2001 Elsevier Science B.V. All rights reserved. Keywords: Discrete cosine transform; Discrete sine transform; Fast transform; Prime-factor decomposition

1. Introduction The discrete cosine transform (DCT) and discrete sine transform (DST) are widely used in a range of applications including image and speech coding, image compression and more. To perform these transforms in real time, high speed very large scale integration (VLSI) circuit architectures, based on computationally e$cient algorithms are often needed. Fast algorithms for DCT and DST include: 1. Butter#y-structure-based methods [15]. These algorithms are computationally e$cient but usually not suitable for VLSI architectures because they require global communication between the multiplier units, which limits the

* Corresponding author. Tel.: #972-3-6407785 (o), 972 3 6419712 (h); fax: #972-3-6407095. E-mail addresses: [email protected] (G. Kutz), [email protected] (H. Ur).

computational rate. In addition, they are restricted to power of 2 length series. 2. Systolic arrays [3,9,10,14]. These algorithms are designed for hardware implementation, but, depending on the speci"c implementation, require many processing elements or have high latency (measured as the time required to complete the transform calculation) and low throughput (measured as the time delay between the "rst output of a data sequence and the "rst output of a consecutive data sequence). 3. Prime factor decomposition algorithms [4,11,12]. Prime factor decomposition-based algorithms possess important advantages including the ability to compute transforms of a length which is not a power of 2, low computational complexity, low number of required multipliers and, as shown here, low latency and high throughput. Cho and Lee proposed a prime-factor-based algorithm for computing DCT and DST [4]. It is based on computing

0165-1684/01/$ - see front matter  2001 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 5 - 1 6 8 4 ( 0 0 ) 0 0 2 1 1 - 5

336

G. Kutz, H. Ur / Signal Processing 81 (2001) 335}343

a 2-D discrete Fourier transform (DFT) followed by a post-processing stage, and is suitable for any series length, N, that can be decomposed into N"N ;N ,   where N and N are co-prime odd numbers. Tat  saki et al. proposed another version of the algorithm, which, using a better choice of index mapping, eliminates the need for post-processing stage for an odd N, and extends the algorithm for an even N, with a reduced number of multiplications in the postprocessing stage [11,12]. Both [11,12] su!er from inaccuracies, and the algorithms presented would yield erroneous results. This is explained in detail later and was veri"ed using numerical simulations. In this paper, two e$cient VLSI architectures, based on prime factor algorithms are presented, one which has minimum number of multipliers while the other possesses very high throughput. The inaccuracies in [11,12] are analyzed and a corrected algorithm is proposed for N odd. It is shown that for N even, the prime factor algorithm is inef"cient. We present, however, an e$cient algorithm for N even when N has only one factor of two. Several extensions are discussed including decomposing N into more than two factors. In addition, the algorithms presented here are considered in terms of latency and throughput and compared to other fast algorithms. It is shown that the algorithm proposed yields a better trade-o! between computational complexity and high throughput than other algorithms. There are other possible architectures based on the proposed algorithms all yielding good complexity-throughput trade o!. The paper is organized as follows: In Section 2, the de"nitions used for DCT and DST are presented. The prime factor algorithm, and the corrections for [11] are presented in Section 3 for N odd. The analysis for N even, including the corrections to [11,12] and an algorithm proposed for N having only one factor of 2 is presented in Section 4. Finally, the two VLSI architectures proposed are described in Section 5 and compared to other fast architectures.

2. DCT and DST de5nitions There are several similar de"nitions for DCT and DST. We will use the de"nition proposed by

Ahmed et al. [1] for the DCT, y(k) for the data sequence x(n), 0)n)N!1, as





,\ p(2n#1)k y (k)"a(k) x(n)cos , "!2 2N L 0)k)N!1,

(1)

where





1 2 , a(k)" , 1)k)N!1. N N

a(0)"

For the DST we follow Wang's de"nition [13] since it is the closest to the above DCT de"nition





p(2n#1)k ,\ , y (k)"a(k) x(n)sin "12 2N L 1)k)N, 0)n)N!1.

(2)

The algorithms which are presented next are suitable also for other de"nitions including Jain's de"nition [6] of the DST, and for the inverse transforms, with slight modi"cations. It should be noted that the last term in the DST, k"N, can be calculated with no multiplications by



2 ,\ x(n)(!1)L, 0)n)N!1. (3) N L In what follows, the last term of the DST is considered to be computed separately by (3). The normalization term a(n) is neglected.

y(N)"

3. Algorithm for 1-D DCT/DST for odd length In this section, the prime factor algorithm for odd length that is suggested in [11] is described and corrected. The performance of the corrected algorithm is analyzed and compared to other fast algorithms for DCT/DST transforms. If we neglect the scale factor then 1!D N point DCT is obtained from





,\ y (k)"Re x(n);L>I , ! , L k"0, 1, 2, N!1,

(4)

G. Kutz, H. Ur / Signal Processing 81 (2001) 335}343

and the DST





,\ y (k)"Im x(n);L>I , k"1, 2 ,2, N, , 1 L

again according to [11] (and to be corrected later) by

(5)



y (k , k )"Im 1  

, \ , \  ?, L > x(n , n )=AI   , L  L 



 @, L > ;=BI ,

where

 

jpn ;L "exp . , N

(6)

Using prime factor decomposition, the 1-D transform is expressed as 2-D N ;N DFT.N and    N are co-prime, odd numbers such that  N ;N "N.   The time and frequency indices are mapped into 2-D arrays with n"1An #Bn 2 ,   ,

k"1Ck #Dk 2 ,   ,

(7)

for n , k 3[0, N !1], n , k 3[0, N !1], where       12 denotes modulo N. , The constants A, B, C, D should be chosen such that this mapping would be one to one. The conditions for this kind of mapping to be one to one is given in [2] and can be summarized with A"aN or B"bN ,   (A, N )"(B, N )"1,   C"cN or D"dN ,   (C, N )"(D, N )"1,  

(8)

B"bN , C"cN , D"dN .   

(9)

Since both N and N are odd numbers, it is clear   from (8) that all constants a, b, c, d can be chosen to be even numbers. Substituting (7) into (6), one obtains according to [11] (and to be corrected later)  ?, L >;BI @, L >. ;L>I";AI , , ,

for k even, and



y (k , k )"Re 1  

, \ , \  ?, L > x(n , n )=AI   , L  L



 @, L > ;=BI ,

(11)

for k odd. The kernel of the transform is de"ned as =LI"exp(2jpnk/N). (12) , This is a DFT kernel. Thus, the 1!D DST is computed by 2-D DFT. The calculation of DCT is similar with Re instead of Im and vice versa. There is a mistake, however, in this algorithm: the function ;L>I has a period of 2N (and not , N as in DFT) so when substituting (7) into (6), the modulo N should not be ignored in the mapping as was done in [11]. Consequently, Eqs. (10) and (11) must be modi"ed. To correct the algorithm, we will write the relationship between the mappings with and without the modulo function. aN n #bN n "zN#1aN n #bN n 2         , "zN#n (13)

where (a, b) represent the greatest common divisor of a and b. We will choose the constants so that A"aN , 

337

(10)

Assuming this to be correct and if c and d are chosen to be a multiple of 4, then the DST is given,

and cN k #dN k "vN#1cN k #dN k 2         , "vN#k, (14) where v and z are integers. Using (13) and (14) we can express ;L>I as ,       ?, L > @, L > ;L>I";AI ;BI , , , !jpv(2n#1) ;exp . (15) 2





This equation is similar to (10) except for a correction factor of exp[!jpv(2n#1)/2], where v is de"ned in (14).

338

G. Kutz, H. Ur / Signal Processing 81 (2001) 335}343

Because N is odd and c and d are even, it can be concluded from (14) that v is even i! k is even. In this case, the correction factor will be equal to 1 or (!1) for all values of n. Thus, the result is that the algorithm is correct for k even, except that for some of its outputs we must add a negation operation. This does not add signi"cantly to the computational complexity of the algorithm and it can be neglected in the complexity evaluation. When k is odd, it is concluded from (14) that v is odd. In this case, the correction factor will be equal to (!j sin(pv/2)) for n even and ( j sin(pv/2)) for n odd. The dependency of the correction factor on v can be neglected because, as for k even, it is constant for all n, and we can deal with it by adding a negation operation for some of the outputs. The dependency of the factor on n, however, cannot be neglected. Since the DST/DCT requires a summing over all values of n, the result of the algorithms in [11,12] will lead to an error. The solution to this problem is to compute separately the transform for even k and odd k. The computation of k odd will be done by reversing the sign of odd indexed input terms, while for k even, the original data sequence without modi"cation will be used as input. The real part of the output sequence obtained from the block for k even and the imaginary part of the output sequence obtained for the block for k odd are the DCT transform. The DST will be obtained from the imaginary part of the output sequence of the block for k even and real part of the output sequence for the block of k odd. If we denote t (k) as the output of the "rst block for  k even and t (k) as the output of the second block  for k odd, then

 

Re+t (k),, k even,  y (k)" ! Im+t (k),, k odd, 

(16)

Im+t (k),, k even,  y (k)" 1 Re+t (k),, k odd. 

(17)

This algorithm can be easily generalized to cases where N is factored into 3 or more relatively prime, odd numbers. This procedure is done separately for k even and k odd and thus, requires twice the number of arithmetic operations than the original algorithm. As

can be seen, however, from the following sections, the corrected algorithm is still attractive in various VLSI architectures.

4. Computing DCT/DST of even length In this section, the inaccuracies in [11,12] for an even N are corrected. The prime factor algorithm is shown to be ine$cient in this case. An e$cient method of computing even length DCT/DST transforms, for length of N, where N/2 is odd, however, is described. First, the DCT is rewritten as





,\ (18) y (k)"Re exp(jpk/2N) x(n);LI . , ! L Here N is decomposed to an odd term N and an  even term N . The index mapping is the same as (7)  and the conditions for the uniqueness are given in (8). Choosing the constants as in (9), the following constraints should apply among the constants 1bc2 "1ad2 "0, 1acN 2  "j , , ,  ,  (19) 1bdN 2  "j .   , Substituting (7) into the kernel of the transform in (18) we get, according to [11] and [12] ;LI";HL I ;HL I . (20) , , , This kernel is used in [11,12] to transform the DCT/DST calculation to a 2-D DFT calculation. But, as in the case for an odd N, there exists a correction factor. An accurate calculation of the kernel yields ;I ;LI";H L>AI ;H L >BI , , , , (2n#1)v#2zk ;exp !jp . 2





(21)

where z and v are de"ned in (13) and (14). The factor ;I , which is considered only in the , post-processing stage in [11,12] has been included here in the calculation to obtain a single correction factor for the algorithm. In this case, the correction factor, exp(!jp[(2n #1)v#2zk]/2), is more complex than in the case for an odd N. For example, it can be veri"ed that if

G. Kutz, H. Ur / Signal Processing 81 (2001) 335}343

we want to use the prime factor algorithm for N"6, we have to split the calculation to four di!erent cases, each requiring a di!erent pattern of correction (Case 1 } k"0,4; Case 2 } k"2; Case 3 } k"3; Case 4 } k"1,5). This, of course, makes the prime factor algorithm ine$cient for an even N. A method for computing transform for N even, when N/2 is odd, is presented in the following. We decompose "rst the DCT transform (the same procedure applies for DST) to a sum of odd and even terms ,\ y(k)" x(n)e pL>I, L ,\ x(2n)e pL>I, e\ pI, "Re L ,\ # x(2n#1)e pL>I, e pI, . L (22)

 







As can be seen from this equation, the transform for k3+0, 2, N/2!1,can be computed by computing 2 DCT's for two N/2 length sequences, consisting of the even and odd indexed terms of the N length series and a post-processing stage. In the postprocessing stage, the "rst output is multiplied by exp(!jpk/2N), the second output by its complex conjugate, and then the two results are added. If the DCT computation for the "rst series (for even n) yields the results t (k) and t (k) for k even and odd,   respectively, as was explained in Section 3, and



y(k)"

339

DCT computation for the second series yields the results l (k) and l (k), then   Re+t (k)e\ pI,#l (k)e pI,,,   k even and k3+0 ,2, N/2!1,, y(k)" (23) Im+t (k)e\ pI,#l (k)e pI,,,   k odd and k3+0 ,2, N/2!1,.



The same procedure for k3+N/2, 2, N!1, yields





,\ x(2n)e pL>I\,, L ;e pL>e\ pI\,,

y(k)"Re



#



,\ x(2n#1)e pL>I\,, L



;e pL>e pI\,, .

The terms in the square brackets are DCT transforms of length N/2, but now there is a correction factor j, n even, (25) e pL>" !j, n odd.



So the DCT should be calculated now by reversing the signs of every odd indexed term in the input sequence. But as explained in the previous section, we have already calculated it as a part of the prime factor algorithm for N/2 (odd length series), so the result for DCT for length N is achieved without additional calculations except the post-processing stage. The "nal result is

Re+t (k)e\ pI,#l (k)e pI,,,  

k even and k3+0, 2, N/2!1,,

Im+t (k)e\ pI,#l (k)e pI,,,  

k odd and k3+0, 2, N/2!1,,

 

 

 

 

(24)

 

N N Im t k! e\ pI\,,#l k! e\ pI\,, , k odd and k3+N/2, 2 , N!1,,   2 2

N N Re t k! e\ pI\,,#l k! e\ pI\,, ,   2 2

k even and k3+N/2, 2 , N!1,.

(26)

340

G. Kutz, H. Ur / Signal Processing 81 (2001) 335}343

5. VLSI architectures based on the prime factor algorithm In this section, two hardware implementations for odd length series, based on the algorithm presented in Section 3, are described and compared to other implementations. Other architectures, which are a mixture of the two described here, are possible and can be designed to achieve speci"c complexity (number of required multipliers) and throughput. The hardware implementations can be based on existing VLSI architectures for the calculation of

short DFT's. The required architecture for N even (where N/2 is odd), based on Section 4, is identical to N odd with the addition of a post-processing stage. 5.1. VLSI architecture with a minimum number of multipliers This architecture is similar to the one proposed in [4] except that it eliminates the need for the post-processing stage after the second DFT. This is achieved with a better choice of index mapping

Fig. 1. VLSI architecture with minimum number of multipliers.

G. Kutz, H. Ur / Signal Processing 81 (2001) 335}343

(as done in [11]). The architecture, for the N"15"5*3 transform, is described in Fig. 1. The "rst block is for computing an even k and the second for an odd k. The real/imaginary part is taken from t(k) to obtain DCT/DST according to (16) and (17). The number of multiplication's for several short DFT's, f (N), which is given in Table 1 [7], is for real valued input sequences. The number of multiplications for complex input series is 2f (N). If we are interested, however, in the real or imaginary part of the result, as in the case of DCT/DST, then we need only f (N) multiplications (this can be seen from examining the calculation process in [7]: every multiplication and addition in the calculation process is intended to compute the real or imaginary part of the result but not both). The "rst DFT block has a real-valued input, while the second DFT block requires only computation of real/imaginary part of the result, so the number of multipliers in them is f (N ) and f (N ), respectively. As already   mentioned, two N ;N architectures are re  quired, one for computing an even k and another for an odd k. The total number of multipliers needed thus is 2[ f (N )#f (N )].   In the case of 3 co-prime factors, there are 3 DFT blocks. The middle block has complex inputs

Table 1 Number of multiplications for short DFT's DFT length } N

3

5

7

9

Multiplications } f (N)

1

5

8

8

341

and outputs, so the number of multipliers in that case is 2[ f (N )#2f (N )#f (N )].    Table 2 compares the number of multipliers required for di!erent DCT algorithms. As can be seen from the table, this version of the algorithm requires less multiplier cells than other algorithms which are known to be e$cient. In addition, both [5,8] require butter#y stages which make them less attractive for hardware implementations, while the prime factor algorithm is locally connected (does not require connection between remote processing elements), and does not require butter#y stages. 5.2. VLSI architecture for high throughput Another option for using this algorithm is to compute N blocks of N point DFT in parallel   and then N blocks of N point DFT in parallel.   The architecture is depicted in Fig. 2 (only the block for an even k is shown because the second block for k odd is identical with reversing the sign of odd indexed inputs). The architecture is the same as the one described in [11] but has two di!erent parts, one for an even k and another for an odd k. This results in a high throughput architecture at the cost of additional multipliers and adders. The number of multiplications (and multipliers) in this algorithm is 2[N f (N )#N f (N )].     For the case of 3 factors, the number of multiplications is 2[N N f (N )#2N N f (N )#N N f (N )].         

Table 2 Number of multipliers needed for several algorithms Length

7

Proposed algorithm [4] [8] [5]

16

8

20 12 7

9

15

16

10

22

24

16

21

32

18 32 32 15

80 31

35

45

63

24

24

32

52

52

76

64

192 63

342

G. Kutz, H. Ur / Signal Processing 81 (2001) 335}343

In Table 3, the number of multiplications is compared to a fast, butter#y structured based, algorithm for computing the DCT [15]. The number of multiplications for the DCT is comparable to the butter#y based algorithm, which is known to be e$cient in this sense. For the DST, however, there exists another variant of butter#y based algorithm which is more e$cient than the prime factor algorithm [15]. As already mentioned,

however, these algorithms are not well suited for hardware implementations. Additional important considerations in evaluating the architecture performance are it's latency and throughput, as de"ned in the introduction. Table 4 summarizes the number of multipliers, latency and throughput of the two versions of the prime factor algorithm presented here and compares with other fast algorithms [10]. Only algorithms for hardware implementations are considered in the table. g(N) in Table 4 is the latency of the DFT block in the algorithm and is dependent on the speci"c implementation. By considering various architectures for several short DFT blocks, with various degrees of parallelism, we estimate g(N) to be 3}8 clock cycles. ¹ represents the time required for additions and multiplications in a single cycle. 5.3. VLSI architectures for an even N For an even N (with only one factor of 2), two odd length (N/2) series hardware blocks and a post-processing stage block, as explained in Section 4, are required. In the post-processing stage, the two factors that multiply the DCT results are complex conjugate,

Fig. 2. VLSI architecture for high throughput.

Table 3 Comparison of numbers of multiplications Length

7

Proposed algorithm [15]

16

8

9

15

16

34

16

16

21

32

62 44

35

63

144

256

116

64

105

128

256

548 292

315 1784

708

1668

Table 4 Comparison of algorithm performances

[4] [3] [9] [10] Version 1 Version 2

Real multipliers

Latency

Throughput

4(N#1) 2(N!1) 8N N/2 f (N )#f (N )   N f (N )#N f (N )    

2N¹ (2N!2)¹ [2((N#1)#(N#1]¹ (N!1)¹ N g(N )¹#N g(N )¹     g(N )¹#g(N )¹  

N¹ N¹ (N¹ ¹ max(N , N )¹   ¹

G. Kutz, H. Ur / Signal Processing 81 (2001) 335}343

so we can reduce number of multiplications by use of y (k)(a#ib)#y (k)(a!ib)   "a(y (k)#y (k))#jb(y (k)!y (k)).     For k"0 or k"N/2 there is no need for postprocessing multiplication. Thus, the number of real multiplications, in the post-processing stage, is 4(N!2), for computing DCT/DST simultaneously. If one wants to compute only one of the transforms then for every k only the real or imaginary part of the result needs to be computed, as can be seen from the last equation. Thus, only 2(N!2) multiplications are needed for one transform computation. The number of multiplications needed for computing the transform is, thus 4[N f (N )#2N f (N )]#2(N!2),     where N N ;N "   2

and

N is odd. 2

6. Conclusions E$cient hardware implementations for computing on line DCT and DST, based on prime factor algorithm, have been discussed in this paper. Algorithms for an odd N and for an even N (where N/2 is odd) have been proposed. The algorithms are suitable also for computing the inverse transforms (IDCT, IDST) with slight modi"cations. Two VLSI architectures based on these algorithms have been presented. The VLSI architectures have been analyzed and shown to be superior to the existing implementations. Other implementations based on these algorithms are also possible and can be designed to achieve speci"c complexity * throughput trade o!.

343

References [1] N. Ahmed, T. Natarajan, K.R. Rao, Discrete cosine transform, IEEE Trans. Comput. 23 (1974) 90}93. [2] C.S. Burrus, Index mapping for multidimensional formulation of the DFT and convolution, IEEE Trans. Acoust. Speech Signal Process. ASSP-25 (3) (June 1977) 239}242. [3] W. Chang, M.C. Wu, A uni"ed systolic array for discrete cosine and sine transform, IEEE Trans. Signal Process. SP-39 (1991) 192}194. [4] I. Cho, S.U. Lee, DCT algorithms for VLSI parallel implementations, IEEE Trans. Acoust. Speech Signal Process. 38 (1) (January 1990) 121}127. [5] H.S. Hou, A fast recursive algorithms for computing the discrete cosine transform, IEEE Trans. Acoust. Speech Signal Process. ASSP-35 (October 1987) 1455}1461. [6] A.K. Jain, Fundamentals of Digital Image Processing, Prentice-Hall, Englewood Cli!s, NJ, 1989, pp. 150}155. (Chapter 5). [7] D. Kolpa, T. Parks, A prime factor algorithm using high speed convolution, IEEE Trans. Acoust. Speech Signal Process. 25 (8) (August 1977) 281}294. [8] B.G. Lee, A new algorithm to compute the discrete cosine transform, IEEE Trans. Acoust. Speech Signal Process. ASSP-32 (December 1984) 1243}1245. [9] H. Lee, On computing 2-D systolic algorithm for discrete cosine transform, IEEE Trans. Circuits and Systems CAS37 (1990) 1321}1323. [10] S.B. Pan, R.H. Park, Uni"ed systolic array for fast computation of the discrete cosine transform, discrete sine transform, and discrete hartley transform, Optical Engineering 36 (12) (December 1997) 3439}3444. [11] A. Tatsaki, C. Dre, T. Stouratis, C. Goutis, On the computation of the prime factor DST, Signal Process. 42 (1995) 231}236. [12] A. Tatsaki, C. Dre, T. Stouratis, C. Goutis, Prime factor DCT algorithms, IEEE Trans. Signal Process. 43 (3) (March 1995) 772}776. [13] Z. Wang, Fast algorithms for the discrete W transform and for the discrete Fourier transform, IEEE Trans. Acoust. Speech Signal Process. 32 (8) (August 1984) 803}816. [14] L. Wang, C.Y. Chen, High throughput VLSI architectures for the 1-D and 2-D discrete cosine transform, IEEE Trans. Circuits and Systems Video Technol. CSVT-5 (1995) 31}40. [15] P. Yip, K. Rao, A Fast computational algorithm for the discrete sine transform, IEEE Trans. Commun. com-28 (2) (February 1980) 304}307.