Signal Processing 81 (2001) 2333–2342
www.elsevier.com/locate/sigpro
DCT hardware structure for sequentially presented data T.C. Tan, Guoan Bi ∗ , Yonghong Zeng, H.N. Tan School of Electrical and Electronic Engineering, Nanyang Technological University, Block S1, Nanyang Avenue, Singapore 639798, Singapore Received 8 May 2000; received in revised form 12 December 2000
Abstract This paper shows that a fast DCT algorithm is mapped into a hardware structure that consists of log2 N modules. Such a hardware structure can be used for bit-serial word-serial or word-serial bit-parallel implementation. Compared to other methods of hardware implementation, the proposed one provides a naturally interface with sequentially presented input data, achieves a high utilization of hardware, requires a low processing latency and has a modular architecture that can be extended to support di4erent transform sizes. ? 2001 Elsevier Science B.V. All rights reserved. Keywords: Discrete cosine transform; Bit-serial processing
1. Introduction Many real time applications of the DCT rely upon hardware implementation to support high processing throughput. Some proposals employ parallel processing methods [6] while others use serial processing methods [2,3,7]. The main objectives of these e4orts are to match system throughput to maximize hardware utilization and to minimize chip area. For sequentially presented data, serial-to-parallel and parallel-to-serial converters have to be employed, which increases both the hardware complexity and the processing latency. On the other hand, serial processing approach supports low processing throughput with a signi?cantly less chip area. In particular, it is easier to use the serial ∗ Corresponding author. Tel.: +65-790-4823; fax: +65-7933318. E-mail address:
[email protected] (G. Bi).
processing techniques to achieve a cost-e4ective solution because optimization on wiring complexity and higher operating speeds can be achieved. Serial processing techniques can be used to naturally interface with signals that are transmitted bit-sequentially or word-sequentially from a remote data source. Therefore, it is possible to remove the serial-to-parallel and parallel-to-serial conversions, which saves both conversion hardware and the associated processing latency. In terms of processing throughput per unit chip area, serial processing method has also a good potential of high chip area utilization. Several designs [2,3,7] that compute one length-N transform within N word cycles were reported. When N is large, the methods reported in [3,7] are not desirable because the number of multipliers is increased proportionally to the transform size. On the other hand, a new systolic implementation, based on Hou’s [5] algorithm, requires only log2 N multipliers [2], which
0165-1684/01/$ - see front matter ? 2001 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 5 - 1 6 8 4 ( 0 1 ) 0 0 1 1 3 - X
2334
T.C. Tan et al. / Signal Processing 81 (2001) 2333–2342
signi?cantly minimizes the total chip area. According to the computational complexity of radix-2 fast algorithm, a 1D DCT of length N needs 0:5N log2 N multiplications and 1:5N log2 N − N + 1 additions. Within N input cycles, an eJcient processing system should ideally use about 0:5 log2 N multipliers and 1:5 log2 N adders on average. However, most reported processing systems require much more hardware resources than necessary. Based on a modi?ed fast algorithm, this paper proposes a 1D array structure that is particularly suited to sequentially (or serially) presented data. The array is composed of log2 N modules, each containing an input bu4er, a butterKy module and a multiplier. In particular, the proposed hardware structure naturally provides interfaces with sequential presented data without requiring serial-to-parallel and parallel-to-serial converters. In addition to the properties of the systolic array structure, the proposed design achieves a low processing latency and requires less memory to support high hardware utilization compared to other reported work. 2. The fast algorithm If the normalization factors are ignored, the 1D type-II N -points DCT of real sequence x(n) is de?ned by X (k) =
N −1
x(n) cos
n=0
0 6 k 6 N − 1;
(2n + 1)k ; 2N (1)
where N = 2r is the size of the transform. By relating DCT to DFT, Chan’s [1] reported a fast algorithm that need to shuNe N input data. For the transform outputs of even indices, the fast algorithm can be derived based on the decimation-in-frequency (DIF) approach. The transform outputs of odd indices can be indirectly obtained from a recursive computation. Our approach needs only to reverse the input order of N=2 data. The proposed and Chan’s algorithms appear to be similar to each other except the di4erence in data indexing approaches and hence the twiddle factors used. The fast algorithm is described by the
three equations below [4] and the signal Kow graph is illustrated in Fig. 1 for N = 8. N=2 − 1 (2n+1)k ; [x(n)+x(N − n − 1)]cos X (2k) = 2(N=2) n=0 0 6 k 6 N=2 − 1; N=2−1
F(k) =
[x(n) − x(N − n − 1)]cos
n=0
(2) (2n + 1) 2N
(2n + 1)k ; 0 6 k 6 N=2 − 1; (3) 2(N=2) X (1) = 0:5F(0); X (2k + 1) = F(k) − X (2k − 1); ×cos
1 6 k 6 N=2 − 1
(4)
The even indexed outputs X (2k) are directly calculated from (2), whereas an intermediate step in (3) is necessary for the odd indexed outputs X (2k + 1). Fig. 1 shows the data Kow graph that has a regular and simple structure that can be mapped into hardware implementation by directly exploiting the inherent parallelism in the algorithm. Each node in Fig. 1 can be implemented with adders and=or multipliers and these nodes are interconnected according to the Kow graph. Such implementation has a few major problems. The ?rst one is that it requires many multipliers and adders to occupy too much chip area. If the input sequence is sequentially transmitted from a remote source, the serial-to-parallel and parallel-to-serial converters have to be used to interface the input=output sequences. A further consequence of the direct parallel mapping is that the hardware operators cannot be fully utilized since they have to wait for the entire input sequence. One alternative approach is to process the individual data of the input sequence when it arrives. The partially processed result is accumulated to produce the ?nal outputs sequentially after the last data in the input sequence arrives. The fast algorithm given by (2) – (4) particularly supports sequential operations because X (2k +1) can only be computed after X (2k − 1) is available. This recursive property does not support parallel processing, which requires all the input data are available at the same time. Mapping the fast algorithm to a suitable hardware structure can avoid these potential drawbacks.
T.C. Tan et al. / Signal Processing 81 (2001) 2333–2342
2335
Fig. 1. DCT data Kow graph for N = 8.
Timing is important to convert Fig. 1 into a hardware structure that deals with the sequential input data. Fig. 1 shows that computation can only be started after the ?rst N=2 data are collected, i.e. until x(N=2) arrives. Therefore, the preprocessing to combine the input data is started from x(N=2 − i − 1) + x(N=2 + i); i = 0; : : : ; N=2 − 1, and then x(N − i − 1) − x(i); i = N=2; N=2+1; : : : ; N − 1, where the numbers given inside the circles in Fig. 1 are the order of the addition=subtractions to be performed. If these operations are performed sequentially, one adder=subtractor is needed to perform the N operations within the N input cycles. At the second stage, similar operations are performed except that the index distance of the input data at stage 2 (or the outputs from stage 1) becomes N=4 instead of N=2, which is used at the ?rst stage. In general, such operations are performed in the ?rst log2 N stages as shown in Fig. 1. The range of input index at the ith stage is N=2i . The operations performed at the output
stage are subtractions, according to (4) for the odd indexed outputs and data swapping, as indicated in Fig. 1. 3. System level design The Kow graph shown in Fig. 1 needs a folded hardware structure to obtain odd- and even-indexed DCT outputs. At the ith stage in Fig. 1, additions de?ned in (2) are used for the even indexed outputs within the ?rst N=2i cycles and subtractions de?ned in (3) are performed for the odd indexed outputs in the next N=2i cycles. Then the di4erences are multiplied by twiddle factors. An input bu4er is needed to provide the required data sequence at the right time and right place. The output stage produces the ?nal transform outputs by performing recursive subtractions according to (4). Fig. 2 presents the system block diagram for N = 8 deduced from the data Kow graph in Fig. 1. A 1D DCT of length-2r
2336
T.C. Tan et al. / Signal Processing 81 (2001) 2333–2342
Fig. 2. Proposed system block diagram for an 8-point 1D DCT.
generally needs r processing stages together with a postprocessing stage. The preprocessing block for even (or odd)-indexed outputs involves reordering input data and generating new sequence by adding (or subtracting) x(n) to (or from) x(N − n − 1) for n = 0 to N=2i − 1. In the postprocessing, even-index outputs only need to be reshuNed to obtain a sequence in a natural order. However, odd-indexed outputs are obtained by not only reshuNing but also additional computation according to (4). Four control signals are needed for these purposes and their timings are shown in Table 1. In our presentation, the control signals are preceded by an underscore “ ” symbol and switches are implemented using multiplexer and demultiplexer. In Fig. 2, the 2-point 1D DCT module is located in the middle of the array structure. We can extend the 2-point 1D DCT module by adding one preprocessing stage and one postprocessing stage to compute the 4-point 1D DCT. Therefore, the 1D DCT of an arbitrary N = 2r points can be easily achieved by cascading a suitable number of preprocessing and postprocessing stages. Fig. 2 shows the structure for an 8-point DCT that contains two preprocessing and two postprocessing stages.
Table 1 Timing for interconnection signals (N = 8) Clock
Pre 8
0
—
4 5 6 7 8 9 10 11 12 13 14 15 16
1a 1a 1a 1a 2a 2a 2a 2a 1a 1a 1a 1a 2a
Pre 4 — ::: — — 1b 1b 2b 2b 1b 1b 2b 2b 1b 1b 2b
Post 4
Post 8
—
—
— — — — 1c 1c 2c 2c 1c 1c 2c 2c 1c
— — — — — 1d 1d 1d 1d 2d 2d 2d 2d
4. Functional modules This section presents details of the three basic modules—preprocessing module, 2-point 1D DCT module and postprocessing module. Their operations together with the timing requirements and control signals are also described.
T.C. Tan et al. / Signal Processing 81 (2001) 2333–2342
2337
Fig. 3. The preprocessing block at the ith stage.
4.1. Preprocessing module
function is to redistribute the sequentially received data according to the timing required by Table 2. For simplicity of presentation, let us consider the operations needed by the ?rst stage of Fig. 1. Both bu4ers are shift registers for N=2 input data. However, the bi-directional bu4er can also reverse the order of data sequence, which can be realized by shift registers and some logical gates for control of shifting direction. The multiplexer and
This module performs the functions of additions (or subtraction) and conversion from the sequentially presented input stream into two data streams. Fig. 3 shows a functional module that performs the conversion function. It consists of one bi-directional bu4er, one uni-directional bu4er, one 2 : 1 muliplexer and one 1 : 2 demultiplexer. Its main
Table 2 Timing of preprocessing module at the ?rst stage Clock
Data in xi (n)N
Sel 8p ts
A
B
C
D
Mux 8pts xij (n)N
Mux 1
Mux 2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
x1 (0)8 x1 (1)8 x1 (2)8 x1 (3)8 x1 (4)8 x1 (5)8 x1 (6)8 x1 (7)8 x2 (0)8 x2 (1)8 x2 (2)8 x2 (3)8 x2 (4)8 x2 (5)8 x2 (6)8 x2 (7)8 x3 (0)8 x3 (1)8 x3 (2)8
1a 1a 1a 1a 2a 2a 2a 2a 2a 2a 2a 2a 1a 1a 1a 1a 1a 1a 1a
—a — — — x1 (4)8 x1 (5)8 x1 (6)8 x1 (7)8 — — — — x2 (4)8 x2 (5)8 x2 (6)8 x2 (7)8 — — —
— — — — x1 (3)8 x1 (2)8 x1 (1)8 x1 (0)8 x1 (4)8 x1 (5)8 x1 (6)8 x1 (7)8 — — — — — — —
— — — — — — — — x1 (3)8 x1 (2)8 x1 (1)8 x1 (0)8 — — — — x2 (3)8 x2 (2)8 x2 (1)8
— — — — — — — — — — — — x2 (3)8 x2 (2)8 x2 (1)8 x2 (0)8 x2 (4)8 x2 (5)8 x2 (6)8
— — — — x1 (3)8 + x1 (4)8 x1 (2)8 + x1 (5)8 x1 (1)8 + x1 (6)8 x1 (0)8 + x1 (7)8 x1 (3)8 − x1 (4)8 x1 (2)8 − x1 (5)8 x1 (1)8 − x1 (6)8 x1 (0)8 − x1 (7)8 x2 (3)8 + x2 (4)8 x2 (2)8 + x2 (5)8 x2 (1)8 + x2 (6)8 x2 (0)8 + x2 (7)8 x2 (3)8 − x2 (4)8 x2 (2)8 − x2 (5)8 x2 (1)8 − x2 (6)8
— — — — 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0
— — — — 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
a—
represents Do not Care.
2338
T.C. Tan et al. / Signal Processing 81 (2001) 2333–2342
Fig. 5. 2-point DCT.
Fig. 4. Processing module.
demultiplexer select data from the outputs of these bu4ers. As shown in Fig. 3, the ?rst N=2 input data are sequentially shifted into the bi-directional bu4er through input point 1a. Within the next N=2 word cycles, the ?rst N=2 input data, x(N=2 − 1) to x(0), are sequentially available on line B in a reversed order and the next N=2 input data, x(N=2) to x(N − 1), are also sequentially available on line A. The data available on lines A and B are sequentially sent to the next module shown in Fig. 4 for computation. At the same time the data x(N=2 − 1) to x(0) and x(N=2) to x(N − 1) are sequentially shifted into the uni-directional bu4er and the bi-directional bu4er through the input points 2a, respectively. Within the third N=2 word cycles, input data x(N=2 − 1) to x(0) are sequentially shifted out of the uni-directional bu4er to be available on line C, and data x(N=2) to x(N − 1) are shifted out of the bi-directional bu4er and are available on line B. Within the same N=2 word cycles, data x(0) to x(N=2 − 1) for the next transform are sequentially shifted into the bi-directional bu4er through the input point 2a. Within the next N=2 cycles, data x(N=2 − 1) to x(0) and data x(N=2) to x(N − 1) are sequentially available on lines D and A, respectively. Therefore, the preprocessing module can continuously receive the input data for consecutive transform sequences. The details of operations performed by the preprocessing module for N = 8 are shown in Table 2 in which xi (n)N is the input data sequence, where i is the index of transform sequence, n is the data index within a transform sequence and N is the size of transform sequences. We assume that the input data sequentially enter the preprocessing stage continuously starting with xi (0). For the ini-
tial stage of every preprocessing stage, there is a latency of N=2 word cycles before the ?rst computation (x1 (3)8 + x1 (4)8 , for example) begins. After the initialization stage, the same process repeats within every N=2 cycles. In this design, regardless of the size of the transform, only one adder=subtractor is utilized at each stage. Although three control signals are shown in Fig. 4, only two are necessary. The control inputs Mux1 and Add have the same status when “+” represents “1” and “−” represents “0”. In our approach, the same data used for the computation of (x1 (i)N +x1 (i+N=2)N ; i = 0; : : : ; N=2 − 1 are also available for the computation of (x1 (i)N − x1 (i + N=2)N ; i = 0; : : : ; N=2 − 1 (see Table 1, for example). Most reported work deals with x1 (i)N + x1 (i+N=2) and x1 (i)N − x1 (i+N=2) at the same time (e.g. [2]). For sequentially presented input data, this approach generally requires one adder and one subtractor used at the same time. Such an arrangement provides more processing throughput than necessary and extra memory is needed to store the partially processed signals. We can easily extend the same operations for other stages. In general, latency in the preprocessing module at stage i is N=2i and the size of bi-directional and uni-directional bu4er is N=2i , respectively. 4.2. 2-point DCT module The realization of 2-point DCT is shown in Fig. 5. It consists of two one word delay element (M 1 and M 2) for temporary data storage, a 2 : 1 multiplexer selecting the pair of data to be processed, an adder=subtractor together with a multiplier to compute the outputs. This module receives its inputs from the previous stage and produces the sum or di4erence of the inputs depending on the switch’s position. This module has a latency
T.C. Tan et al. / Signal Processing 81 (2001) 2333–2342
2339
Table 3 Timing diagram for 2-point DCT Clock
Mux 4pts xijk (n)N
M1
M2
8 9 10 11 12 13 14 15 16 17
x1ee (1)2 x1ee (0)2 x1eo (1)2 x1eo (0)2 x1oe (1)2 x1oe (0)2 x1oo (1)2 x1oo (0)2 x2ee (1)2 x2ee (0)2
— x1ee (1)2 x1ee (0)2 x1eo (1)2 x1eo (0)2 x1oe (1)2 x1oe (0)2 x1oo (1)2 x1oo (0)2 x2ee (1)2
— — x1ee (1)2 x1ee (0)2 x1eo (1)2 x1eo (0)2 x1oe (1)2 x1oe (0)2 x1oo (1)2 x1oo (0)2
Sel Coe4
Out DCT2
— 1a 2a 1a 2a 1a 2a 1a 2a 1a
— — x1ee (0)2 + x1ee (1)2 (x1ee (0)2 − x1ee (1)2 ) ∗ 0:707 x1eo (0)2 + x1eo (1)2 (x1eo (0)2 − x1eo (1)2 ) ∗ 0:707 x1oe (0)2 + x1oe (1)2 (x1oe (0)2 − x1oe (1)2 ) ∗ 0:707 x1oo (0)2 + x1oo (1)2 (x1oo (0)2 − x1oo (1)2 ) ∗ 0:707
Table 4 Timing for the postprocessing stage (N = 2) Out DCT2 M4 Clock Fijk (k)N
Fig. 6. Postaddition stage.
of two word cycles due to temporary data storage. One control signal, Sel Coe4, is required to control the switch as well as the operation of addition or subtraction. Table 3 shows the details of operations and the timing of the 2-point DCT module. It shows that twiddle factors of 0.707 are multiplied to the di4erences of the input data. 4.3. Shu2e N -k and postaddition stage The postprocessing stage, shown in Fig. 2, produces the odd indexed outputs of the DCT transform according to (4). It consists of r − 1 stages of subtraction and data shuNing. Fig. 6 shows the implementation of subtraction module that recursively generates the odd indexed outputs. It consists of one multiplier and an accumulator. A Start signal is used to clear the one word bu4er M 4 which is used as a temporary data storage. Since M 4 is cleared with the Start signal, the
Out 2pts Start Xi (k)N
12 13
F1eo (0)2 F1eo (1)2
Clear X1eo (0)2
1 0
14 15 16 17
— — F1oo (0)2 F1oo (1)2
X1eo (1)2 — Clear X1oo (0)2
0 — 1 0
18
—
X1oo (1)2
0
X1eo (0)2 = 0:5F1eo (0)2 X1eo (1)2 = F1eo (1)2 − X1eo (0)2 — — X1oo (0)2 = 0:5F1oo (0)2 X1oo (1)2 = F1oo (1)2 − X1oo (0)2 —
Table 5 Timing for the postprocessing stage (N = 4) Out 4-k Clock Fi (k)N M 4 15 16 17 18 19
F1o (0)4 F1o (1)4 F1o (2)4 F1o (3)4 —
Clear X1o (0)4 X1o (1)4 X1o (2)4 X1o (3)4
Out 4pts Start Xi (k)N 1 0 0 0 0
X1o (0)4 = 0:5F1o (0)4 X1o (1)4 = F1o (1)4 − X1o (0)4 X1o (2)4 = F1o (2)4 − X1o (1)4 X1o (3)4 = F1o (3)4 − X1o (2)4 —
adder=subtractor will always be in the computation path, eliminating the need to activate or deactivate the adder=substractor via a switch. The operation details and the required timing are given in Tables 4 and 5 for N = 2 and 4, respectively. It is assumed that the input sequence is in a natural order. The
2340
T.C. Tan et al. / Signal Processing 81 (2001) 2333–2342 Table 6 Timing for the proposed shuNed 4-1 stage Clock
Fig. 7. Structural design for shuNed N -k stage for one element.
multiplication in Fig. 6 is trivial because the multiplicand is 1 or 0.5, which can be accomplished by a shifting operation. This module introduces no additional latency except for the operations of shift and subtraction. Because the outputs of (2) and (3) are not in a natural order, the basic module, as shown in Fig. 7, when cascaded, is used to shuNe the received sequence into a natural order to support the operation to be performed by Fig. 6. This basic module consists of one delay bu4er and two switches which are implemented by 2 : 1 multiplexer. At the start of a new sequence, the switch is connected to position 1a. The delay bu4er M 3 will be ?lled with one data from input. At this point, the output is taken from M 3. When the switch is connected to position 2a, the bu4er M 3 holds the last stored data. At the same time, the input data is directed to the output Out N -k. This function introduces a latency of
In N -k (From Out 2pts) X1ee (0)2 X1ee (1)2 X1eo (0)2 X1eo (1)2 X1oe (0)2 X1oe (1)2 X1oo (0)2 X1oo (1)2 X2ee (0)2
10 11 12 13 14 15 16 17 18
X1e (0)4 X1e (2)4 X1e (1)4 X1e (3)4 F1o (0)4 F1o (2)4 F1o (1)4 F1o (3)4 X2e (0)4
Sel N -k Sel 4-1 1a 1a 2a 1a 1a 1a 2a 1a 1a
M3
Out N -k Xi (k)N
— X1e (0)4 X1e (2)4 X1e (2)4 X1e (3)4 F1o (0)4 F1o (2)4 F1o (2)4 F1o (3)4
— X1e (0)4 X1e (1)4 X1e (2)4 X1e (3)4 F1o (0)4 F1o (1)4 F1o (2)4 F1o (3)4
1 word cycle for every element that need to be shufKed. For each stage, there are N=2 − 1 elements need to be re-shuNed, thus the operation to obtain the natural sequence will introduce a latency of N=2 − 1 cycles. We denote this operation by ShuNe N -k. Tables 6 and 7 show the timing for the shuNing operations for N = 4 and 8, respectively. Three word cycles are introduced to obtain the natural sequence order for N = 8. 4.4. Comparison and discussion This section summarizes the proposed hardware structure. Comparison with other reported work is also made in terms of several criteria. Based on Table 8, which lists the number of a few operators used in the proposed hardware structures, the
Table 7 Timing for the cascaded shuNed 8–3, 8–2 and 8–1 stages Clock 11 12 13 14 15 16 17 18 19 22 23
Out 4pts Xi (k)N X1e (0)4 X1e (1)4 X1e (2)4 X1e (3)4 X1o (0)4 X1o (1)4 X1o (2)4 X1o (3)4 . . .
X1 (0)8 X1 (2)8 X1 (4)8 X1 (6)8 X1 (1)8 X1 (3)8 X1 (5)8 X1 (7)8 . . .
Sel N -k Sel 8-3 1a 1a 1a 2a 2a 2a 1a 1a . . .
Out N -k Out 8-3 . X1 (0)8 X1 (2)8 X1 (4)8 X1 (1)8 X1 (3)8 X1 (5)8 X1 (6)8 X1 (7)8 . .
Sel N -k Sel 8-2 . . 1a 1a 2a 2a 1a 1a 1a 1a .
Out N -k Out 8-2 . . X1 (0)8 X1 (2)8 X1 (1)8 X1 (3)8 X1 (4)8 X1 (5)8 X1 (6)8 X1 (7)8 .
Sel N -k Sel 8-1 . . . 1a 2a 1a 1a 1a 1a 1a 1a
Out N -k Out 8-1 . . . X1 (0)8 X1 (1)8 X1 (2)8 X1 (3)8 X1 (4)8 X1 (5)8 X1 (6)8 X1 (7)8
T.C. Tan et al. / Signal Processing 81 (2001) 2333–2342
2341
Table 8 Comparisons among various modules
Modules Latency Multipliers (exclude 0.5 factor) (A) Bu4ering (B) Timing elements (C) Temp. storage Adders=Subtractors Shifters Switching elements
System blocks (Fig. 2)
Preprocessing (Figs. 3 and 4)
2-pts DCT (Fig. 5)
Postaddition (Fig. 6)
ShuNed N -k (Fig. 7)
log2 N − 1 log2 N − 1
N −2 0
2 1
0 0
N − 1 − log2 N 0
log2 N − 1 0 0 0 0 4(log2 N − 1)
0 2N − 4 0 log2 N − 1 0 4(log2 N − 1)
1 2 0 1 0 1
0 0 log2 N − 1 log2 N − 1 log2 N − 1 0
0 N − 1 − log2 N 0 0 0 2(N − 1 − log2 N )
following summary can be made:
processing throughput that is to compute one complete N -point transform within N word cycles. It is impossible to achieve a processing throughput that is higher than the input throughput of the processing system. For sequentially presented input data, the processing is accomplished sequentially as each input data arrives, which is in contrast to parallel processing that collects the entire input sequence. Table 9 shows that the systems reported in [3,7] need a higher number of multipliers which is in the order of O(N ), whereas the system reported in [2] and the proposed one require only log2 N multipliers, which signi?cantly reduces the hardware complexity. In addition, the hardware algorithm in [2] needs to rearrange the data sequence order into
Latency = 2N − 2 word cycles; No. of multipliers = log2 N ; No. of word delay elements for A + B + C = 3N + log2 N − 4; No. of adder=subtractors = 2 log2 N − 2; No. of multiplexers and demultiplexers = 2N + 6 log2 N − 9; No. of shifters = log2 N − 1. Table 9 compares the proposed structure with the others that were reported in [2,3,7]. It is noted that the processing throughput is used as a reference for the comparison. All the listed designs of the 1D DCT processing systems achieve the same
Table 9 Comparison of architectures for sequential implementation of 1D N -point DCT Parameters
Proposed
Structure in [2]
Structure in [3]
Structure in [7]
Throughput Unit =One transform=Cycles
1=N
1=N
1=N
1=N
I=O form
Serial-in Serial-out
Serial-in Serial-out
Serial-in Serial-out
Serial-in Serial-out
Input sequence to chip Output sequence from chip Number of multipliers Number of delay element Latency Switching elements Number of adders
Natural Natural log2 N 3N + log2 N − 4 2N − 2 2N + 6 log2 N − 7 2 log2 N − 1
Even–odd Natural log2 N 6N + 3 log2 N 2N + 3 log2 N − 1 4N + 7 log2 N 3 log2 N
Natural Natural 2N 4N 2N − 1 0 2N
Natural Natural 2N 4N 3N − 1 4N 2N
2342
T.C. Tan et al. / Signal Processing 81 (2001) 2333–2342
even–odd sequence before processing starts, which inevitably requires additional memory space and processing latency. In contrast the proposed one more e4ectively deals with the sequential input data by the two bu4ers that minimizes the requirements of the amount of data stores and processing latency. It can be clearly seen in Table 9 that the proposed structure also achieves savings on multipliers, the switching elements and adder=subtractors. In addition, the proposed structure also uses a minimum processing latency, which is often desired for many practical applications. 5. Conclusion Based on a recursive fast algorithm, a hardware structure is proposed to directly interface with the sequentially presented data stream. Compared to other similar hardware implementation, the proposed one reduces the number of multipliers, adders=subtractors, and the delay elements. In addition, a better utilization of the hardware and a lower processing latency can also be achieved. The proposed structure and the required functional
modules can be easily extended for transform of other sequence length that is a power of two. References [1] S.C. Chan, K.L. Ho, A new two-dimensional fast cosine transform algorithm, IEEE Trans. Signal Process. 39 (2) (February 1991) 481–485. [2] Y.T. Chang, C.L. Wang, A new fast DCT algorithm and its systolic VLSI implementation, IEEE Trans. Circuits Systems-II: Analog Digital Signal Process. 44 (11) (November 1997) 959–962. [3] L.W. Chang, M.C. Wu, A uni?ed systolic array for discrete cosine and sine transforms, IEEE Trans. Signal Process. 39 (1) (January 1991) 192–194. [4] Guoan Bi, L.W. Yu, DCT algorithms for composite sequence lengths, IEEE Trans. Signal Process. 46 (3) (March 1998) 554–562. [5] H.S. Hou, A fast recursive algorithm for computing the discrete cosine transform, IEEE Trans. Acoust. Speech and Signal Process. ASSP-35 (10) (October 1987) 1455–1461. [6] K.R. Rao, P. Yip, Discrete Cosine Transform: Algorithms, Advantages, Applications, Academic Press, Boston, 1990. [7] C.L. Wang, C.Y. Chen, High-throughput VLSI architectures for the 1D and 2D discrete cosine transforms, IEEE Trans. Circuits Systems Video Technol. 5 (2) (February 1995) 31–40.