A high performance MQ encoder architecture in JPEG2000

A high performance MQ encoder architecture in JPEG2000

ARTICLE IN PRESS INTEGRATION, the VLSI journal 43 (2010) 305–317 Contents lists available at ScienceDirect INTEGRATION, the VLSI journal journal hom...

1MB Sizes 11 Downloads 53 Views

ARTICLE IN PRESS INTEGRATION, the VLSI journal 43 (2010) 305–317

Contents lists available at ScienceDirect

INTEGRATION, the VLSI journal journal homepage: www.elsevier.com/locate/vlsi

A high performance MQ encoder architecture in JPEG2000 Kai Liu a,n, Yu Zhou b, Yun Song Li b, Jian Feng Ma a a b

School of Computer Science and Technology, Xidian University, Xi’an 710071, China National Key Laboratory of Integrated Service Networks, Xidian University, Xi’an 710071, China

a r t i c l e in fo

abstract

Article history: Received 2 April 2008 Received in revised form 8 January 2010 Accepted 8 January 2010

In this paper, a novel architecture for an MQ arithmetic coder with high throughput is proposed. The architecture can process two symbols in parallel. The main characteristics are eight process elements for the prediction of probability interval A, the combination of calculation units for the code register C with the Byteout&Flush procedure, and the use of a dedicated probability estimation table to decrease the internal memory. From FPGA synthesis results, the architecture’s throughput can reach 96.60 M context symbols per second with an internal memory size of 1509 bits, which is comparable to that of other architectures and suitable for chip implementation. & 2010 Elsevier B.V. All rights reserved.

Keywords: MQ encoder Process elements

1. Introduction High performance arithmetic entropy coders have always been of interest to hardware researchers in the field of image compression. Among them, the MQ encoder of the JPEG2000 [1] standard is an important bottleneck for real-time applications. In JPEG2000, after discrete wavelet transforms (DWT), the wavelet coefficients of each code-block are scanned in a particular order for the purpose of quantization and entropy coding, i.e., the MQ encoding. The input of the MQ encoder is the context decision pairs (CXD), which are proportional to the bit-planes that have been coded. The throughput of the MQ encoder is confined to the number of CXD pairs generated by the EBCOT [2,3] context modeling. In order to meet the real-time requirement, high speed MQ encoder architecture must be designed carefully. Gupta’s MQ encoder [4] can process 1.2 CXD pairs per clock cycle on average, and 10 CXD pairs per clock cycle at its maximum. The variable input rate would make data buffering difficult, and require multiple clock domains in connection with each bit plane coder. In [5,6], MQ encoders can process only one CXD pair per clock cycle. There is an MQ encoder that can deal with two CXD pairs per clock cycle in [7]. However, when two contexts are same, the MQ encoder will stall. Therefore, the MQ encoder of [7] cannot offer two CXD pairs per clock cycle consistently. The most efficient arithmetic coder probably comes from [8]. In [8], Dyer proposed five different architectures for high-speed MQ encoders. Within these architectures, the brute force architecture can produce the permanent ability to code two CXD pairs per clock cycle.

n

Corresponding author. Tel.: + 86 29 88203110. E-mail address: [email protected] (K. Liu).

0167-9260/$ - see front matter & 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.vlsi.2010.01.001

In this paper, we study the principle of the MQ algorithm used in JPEG2000 in detail and create our own high performance MQ encoder, which can process two CXD pairs in one clock cycle consistently with reduced memory bits. In our architecture, we use eight different processing elements (PEs) to predict the probability interval and merge the byte out with bit stuffing process for a full pipeline. Another highlight of our design is a simple structure of the probability estimation table (PET ROM). With a small PET ROM, the memory bits used in the architecture can be reduced significantly. The rest of the paper is organized as follows. Section 2 provides an overview of the MQ algorithm in JPEG2000 and shows the challenge for high speed hardware implementation. In Section 3, we describe our proposed parallel MQ encoder in detail. It consists of the calculation of the probability interval and the code value register, the PET ROM design, the analysis of BYTEOUT for the code stream, and the timing of the processing pipeline. Experimental results and comparisons with others are given in Section 4. Section 5 provides a brief summary.

2. Background and motivation 2.1. Principle The MQ encoder used in JPEG2000 is based on the QM adaptive arithmetic encoder in JBIG, but uses the byte emission technique of the Q coder [9]. A simple block diagram of the binary adaptive arithmetic encoder is shown in Fig. 1. The decision (D) and context (CX) pairs formed by the EBCOT processing are the inputs of the MQ encoder [10] to produce compressed image data (CD) output. The CX is a context label with values ranging from 0 to 18.

ARTICLE IN PRESS 306

K. Liu et al. / INTEGRATION, the VLSI journal 43 (2010) 305–317

D CX

Encoder

Table 2 The maximum number of contexts for a code-block of typical images.

CD

Image

Fig. 2. Calculations for the probability interval A and the code register C.

Table 1 The encoder register structures.

C-register A-register

0000 cbbb 0000 0000

LSB bbbb bsss 0000 0000

xxxx xxxx 1aaa aaaa

Airport

Bridge

Los_angeles2 Field

Maximum contexts 32535 47887 49573 48634 Contexts per pixel 7.94 11.69 12.10 11.87

Fig. 1. Arithmetic encoder inputs and outputs.

MSB

Lena

xxxx xxxx aaaa aaaa

The D is a bit that has value 0 or 1. The MQ encoder selects the probability estimation Qe of less probable symbol (LPS) according to the current CXD pair. With each binary decision, the current probability interval A is subdivided into two sub-intervals (i.e., one for the most probable symbol MPS, and the other for the LPS) , and the code string C is modified (if necessary) so that it points to the base (the lower bound) of the probability sub-interval assigned to the symbol. A precise calculation of the sub-interval is shown in Fig. 2. A 16-bit unsigned short data can be used to represent the probability interval A during calculation. The code string register C is a 32-bit unsigned integer with different fields, as shown in Table 1. The different LPS probabilities Qes are stored in an array called the probability estimation table (PET) with 47 entries. The estimation process can be defined as a finite-state machine with 47 states. The index of each PET entry can be obtained by the CX. When the interval A is less than 0  8000 (0.75 in decimal notation) after coding the symbol, renormalization prevents the interval from overflowing. The renormalization procedure shifts the interval register A and the code register C one bit at a time from right to left until the interval A is no longer less than 0  8000. The number of shifts is stored in the register CT. When the CT has counted down to zero, a byte of compressed image data outputs from a special position of the C register. In order to limit carry propagation into the compressed code stream, a bit-stuffing routine is applied. From above, we can see that the A and C registers play an important role during the whole process. Therefore, for hardware implementation, the designs of A and C should receive considerable attention. 2.2. Challenge The biggest challenge in MQ hardware implementation comes from the bottleneck of throughput. A large number of CXD pairs

47874 11.68

are produced by the EBCOT, in proportion to the bit-planes coded in the case of the normal mode (without coding passes in parallel), and the MQ encoder must process all CXD pairs serially. Table 2 gives the number of CXD pairs in a code-block for natural images (block size is 64  64). In Table 2, we can see that there are nearly thirteen CXD pairs for a single pixel. If the MQ encoder works at a speed of one CXD pair per clock cycle, and the clock frequency of the EBCOT module is f, we must use thirteen MQ encoders in parallel or increase the clock frequency of the MQ encoder to 13f for the purpose of real-time processing. Otherwise, we must improve the throughput of the MQ encoder to alleviate the resource and timing constraints. The key issue in improving the speed of the MQ encoder is how to calculate A and C in advance and resolve the correlation between them. Then, the decision on the number of bytes to be emitted must be made in the case of processing multiple CXD pairs simultaneously. In this paper, we focus on processing two CXD pairs in parallel only. According to the MQ algorithm, the code register C depends on the current value of A and renormalization. The BYTEOUT begins after CT is zero. The initial CT is 12 or 13, which is based on the last byte value in the code stream. After one BYTEOUT procedure, CT is 7 if the last byte is FFH; otherwise CT is 8. The interval A can shift left at most 15 times during renormalization. This means that the CT can count down to zero twice within one renormalization. The number of bytes removed from the code string can be 0, 1, or 2 in one renormalization. Therefore, when two CXD pairs are processed in parallel, two renormalizations may occur, which leads to the emission of at most four bytes. In order to achieve two CXD pairs per clock cycle, we must predict the real A and C values for the second CXD pair accurately and distinguish the different kinds of BYTEOUT cases. Then, we must select the correct bits from the code string register C to form the code stream. In the next section, we will give the details of our dual symbol MQ architecture.

3. Architecture of the dual CXD pairs MQ encoder 3.1. Overview The whole architecture of our proposed MQ encoder is shown in Fig. 3. In order to process two CXD pairs in parallel, the encoder undergoes four coding stages. The first stage is used for the prediction of the probability interval A; the second stage is for the calculation of the code string register C; the third stage is for buffering the code stream from the corresponding storage; and the final code streams output at the last stage. The input of the encoder are two CXD pairs, identified as (CX0, D0) and (CX1, D1), which are supplied to the encoder simultaneously. The context labels CX0 and CX1 can be used as addresses for the dual port ram INDEXRAM to access the corresponding indexes and MPS decisions (i.e., MPS_CX0, MPS_CX1). According to the MQ algorithm, the context label can be represented by an integer ranging from 0 to 18, the index of PET is from 0 to 46, and the MPS decision of context is one bit. So the size of INDEXRAM is 19  7 bits. In every unit of INDEXRAM, the top 6 bits stand for the index, and the LSB is for the MPS decision. The indexes read from the

ARTICLE IN PRESS K. Liu et al. / INTEGRATION, the VLSI journal 43 (2010) 305–317

INDEX RAM ADDR . LOGIC

Pe0

LpsLps _En

D0

Pe2

Pe3

LpsMps Diff .

LpsLps Diff .

MpsMps Diff .

MpsLps Diff .

LpsMps Same

LpsLps Same

MpsMps Same

MpsLps Same

MPS _CX0

INDEX RAM MPS _CX1 MpsLps _En

D1

INPUT DATA BUS

LpsMps _En

CX0 CX1

Pe1

307

Pe4

Pe5

Pe7

OUTPUT DATA BUS

MpsMps _En

CX0

Pe6

STAGE 0

Difference

CX1

PET ROM

Same

MUX

INPUT DATA BUS

Leading Zero Detect

Renorme Case

0_value

C_Value Case Qe_value

STAGE 1 Mux

Renorme Once

Renorme Twice0

C_reg

Renorme Twice1 +

ByteOutMux ByteOut 0

ByteOut 1

Flush

ByteOut 2

ByteOut 3

STAGE 2

FIFO3

FIFO2

FIFO1

FIFO0

ByteOutRead STAGE 3 Compressed Code Fig. 3. Architecture of the proposed MQ encoder.

INDEXRAM are supplied to the PEs for the prediction of A by an internal bus. The MPS_CX0 and MPS_CX1 are passed through logical gates with (D0, D1) to decide the type of the two symbols. There are only four types of two CXD pairs: MPSMPS, MPSLPS, LPSMPS, and LPSLPS. The corresponding signals (MpsMps_En, MpsLps_En,LpsMps_En,LpsLps_En) enable the correct PE to make the A prediction through the internal bus. At the same time, CX0 operates CX1 by an XOR gate to indicate whether two contexts are identical. The indicated signal (DIFF) is connected to each of the PEs. There are eight process elements used for the prediction of A and the computation of the shift numbers of A with renormalization, i.e., NumSLA0, NumSLA1. The NumSLA0 is the shift left number of A incurred by the CX0, and the NumSLA1 is the corresponding value caused by the CX1. During the prediction of the interval A, the LPS probability value of Qe and the auxiliary information are accessed from the probability estimation table ROM (PETROM) through the internal bus. After the prediction of one special PE, the interval A is stored by a register. The NumSLA0, NumSLA1, the new index for the second context IDX1, the new

probability value for the second LPS’s Qe1, and the tag of renormalization are all pipelined to the second stage for the calculation of the code string C. After the probability interval A is computed, the second stage judges the type of current renormalization based on two shift numbers of A, i.e., NumSLA0, NumSLA1, and one shift number counter of C, that is, CT, as the RenormeCase module shown in Fig. 3. Because renormalizations may be processed at most twice as two CXD pairs arrive, there are three renormalization modules for all possible cases. The RenormeTwice0 and RenormeTwice1 are responsible for two renormalizations. Each case will lead to a different C value calculation. For the single renormalization case, the MQ encoder uses the RenormeOnce module to deal with CXD pairs. At the same time, the C_ValueCase module provides the different switch signals for the C value update. If the current symbol is an MPS, then the C value is increased by the corresponding probability Qe. Otherwise, the C is increased by zero to keep unchanged. The updated C is supplied to different renormalization units. The shift numbers NumSLA0 and NumSLA1

ARTICLE IN PRESS 308

K. Liu et al. / INTEGRATION, the VLSI journal 43 (2010) 305–317

decide the shift left number of the C register. Then, the active renormalization unit emits some bits (0, 1, 2, 3, and 4 bytes) from the C register to the code stream and starts the bit-stuffing process if the last byte emitted is FFH. During the third stage, the ByteOutMux module selects the compressed byte writing to the code’s FIFOs. Because at most four bytes are emitted, four FIFOs are used to buffer these code bytes. When the last CXD pairs arrive, the Flush module outputs the rest bits in the C register. In the final stage, the ByteOutRead module will read the FIFO0, FIFO1, FIFO2, FIFO3, and Flush module to output code bytes serially. In the following sections, we will describe the detailed structure of the A and C registers. 3.2. Prediction of the probability interval A When two CXD pairs arrive, the different types of input symbols can result in different ways to calculate the A register. There are four types of input symbols with two CXD pairs. They are: both symbols are the most probable symbols, denoted as MPSMPS; the first symbol is the most probable symbol and the other is the less probable symbol, denoted as MPSLPS; the first symbol is less probable and the second the most probable, LPSMPS; and both symbols are the less probable symbol, LPSLPS. The detailed structures of the four types are illustrated in Fig. 4. In Fig. 5, the flow charts for the different conditions are drawn. For simplicity of the controls, these parts are implemented by different circuits, and there is no resource sharing among them. However, there are some circuits that can be shared by different cases if we want to add a control circuit to distinguish them. 3.2.1. MPSMPS case In Fig. 4(a), two MPS symbols are processed in parallel. When the two context labels are not the same, the DIFF signal is set to 1. Qe0 and Qe1 are the probability values for the two different indexes, respectively, from the PETROM memory. If the two context labels are same, the DIFF signal is 0. Because the first index of the context may be changed at the end of the first symbol, a new probability value for the second symbol index is the next MPS index’s Qe value, denoted as NT_NMPS_Qe1 from the PETROM memory at the first index address’s content. Two Qe1 values are selected by the DIFF signal through MUX. 3.2.2. MPSLPS case In Fig. 4(b), one MPS symbol followed by an LPS symbol is processed for the A value. Because the first symbol is same as in the MPSMPS case, the Qe0, Qe1, and NT_NMPS_Qe1 values have identical meaning with those for the MPSMPS. The difference is the judgment choice with the LPS symbol. 3.2.3. LPSMPS case In Fig. 4(c), the structure for the LPSMPS symbols to calculate the A value is drawn. When the first symbol is an LPS, the second MPS probability value Qe comes from the PETROM memory shown as NT_NLPS_Qe1. In addition, the DIFF signal controls the correct value to the internal unit for the prediction of A. 3.2.4. LPSLPS case This case is similar to the above LPSMPS one except for some differences in the conditions, as shown in Fig. 4(d). At this point, we have shown all possible cases for the interval A and their hardware structure. In the structures, NT_NLPS_Qe1 and NT_NMPS_Qe1 are prediction values for the second LPS symbol or MPS symbol as parts of the PETROM content. LZ0 in the PETROM stands for the leading zero of the Qe0. LZ_A is the A

value’s shift number. SHIFT_NUM is the shift left number for the difference of A-Qe0. The determination of LZ0, LZ_A and SHIFT_NUM uses the method in [8] for speedup. As shown in the figures, the main differences are the conditions for the four input types. 3.3. Prediction of the code register C The code register C is calculated as shown in Fig. 6. The NumSLA0 and NumSLA1 are the shift left numbers of the A value after the first symbol and the second symbol pass through the MQ coder. For the C register, the precursive value of C (C_PRE) is computed based on the values of Qe0, Qe1, NumSLA0, and NumSLA1. Then, the update value of C register is set based on the value of C_PRE with different BYTEOUT types, which are no BYTEOUT, one BYTEOUT, and two BYTEOUTs. The CTM is the shift left number for C_PRE, where ‘M’ means multi-symbol. With the different combinations of input symbols (MPSMPS, MPSLPS, LPSMPS, and LPSLPS), the signal C_ADD enables the increase of the C value by Qe1. Three different C values are sent to a flip–flop by an MUX. The Q ports of the flip–flops are the final C value. Up to now, the A and C value are both computed within one clock cycle for our dual symbol code architecture. 3.4. PETROM structure To implement the dual symbol architecture, the PET table used in JPEG2000 must be changed. Because there are 47 indexes in the PET table for contexts, we use a ROM memory to allow a broadening PET table. The depth of PETROM should be 47, with 6 bits as the address wires. At each index, the PETROM shows not only the probability of the current LPS symbol, but the next probable symbol’s information as well. In order to reduce ROM size, we carefully check the content of MQ-related data structures. For the symbol probability table Qe[47], each value of Qe is less than 0  8000, and the last two bits of Qe are constants, i.e., ‘‘01’’ in binary. Then a width of only 13 bits is enough to store each value of Qe. For the next MPS index table NMPS [47], we can use a function to realize the table as follows: 8 x þ 1 x A ½0,4 [ ½6,12 [ ½14,44 > > > < 38 x¼5 NMPS½x ¼ ð1Þ > 29 x ¼ 13 > > : x x ¼ 45,46 In addition, we can use another function for the next LPS index table NLPS [47], as follows: 8 1 x¼0 > > > > > 29 x ¼4 > > > > > 33 x¼5 > > > > > x x ¼ 6,46 > > > > < 3ðx þ1Þ x A ½1,3 NLPS½x ¼ ð2Þ 14 x A ½7,9 [ ½14,15 > > > > > > xþ7 x A ½10,11 > > > > > xþ8 x A ½12,13 > > > > > x1 x A ½16,20 > > > : x2 x A ½21,45 For the switch bit table SWITCH [47], we use the following function:  1 x ¼ 0,6,14 SWITCH½x ¼ ð3Þ 0 else

ARTICLE IN PRESS K. Liu et al. / INTEGRATION, the VLSI journal 43 (2010) 305–317

DIFF Qe1 NT_NMPS _Qe1

DIFF Qe1 NT_NMPS _Qe1 Qe0

Qe0

A

A

1

1

<<

<<

<

<<

-

LZ 0

1

<<

=

<<

<<

<

SHIFT _ NUM

1

=

<<

LZ 0 <<

=

=

SHIFT _NUM

-

-

LZ _A

-

<<

DFF

LZ _A

-

D Q CK

CLK

<<

DFF Q D CK

CLK

DIFF Qe1 NT_NLPS _Qe1

DIFF Qe1 NT_NLPS _Qe1

Qe0 A

309

Qe0 A

1

<<

<<

<

1

<<

LZ 0

1

<<

<

<

SHIFT _ NUM

-

<<

<<

1

=

<<

LZ 0 <<

<

=

SHIFT _ NUM

-

-

LZ _A CLK

<<

DFF

-

Q D CK

LZ _A CLK

<<

DFF Q D CK

Fig. 4. Four types of the interval A value prediction. (a) MPSMPS, (b) MPSLPS, (c) LPSMPS and (d) LPSLPS.

Therefore, we do not need to store any bit of these three tables in the PETROM. In order to further decrease the size of the PETROM, we observe that some values of Qe for the different indexes are the same. For example, Qe [0]¼ Qe[6] ¼Qe[14]¼Qe[46]¼0  5601. Then, we count the number of different values occurrence in the table of Qe [47]. Table 3 shows the detailed information for Qe

[47]. In Table 3, there are only 32 different values in the Qe list. Therefore the necessary size of Qe is only 32  13 bits. The PETROM consists of three symbol probability tables and leading zero bits. The three symbol probability tables are the current LPS symbol probability table, the next symbol of LPS probability table, and the next symbol of MPS probability table. The leading zero bits of each Qe use four bits. Therefore, the size of

ARTICLE IN PRESS 310

K. Liu et al. / INTEGRATION, the VLSI journal 43 (2010) 305–317

Fig. 5. Flow charts of the interval A value prediction. (a) MPSMPS, (b) MPSLPS, (c) LPSMPS and (d) LPSLPS.

PETROM is 32  13  3+32  4¼32  43 ¼1376 bits. Table 4 explains the meaning of each field in the PETROM. The first line in Table 4 shows the name of the field, and the second line is the number of bits used in that field.

lz: the number of leading zero bits with the current LPS probability value Qe, 4 bits; qe: the symbol probability value of the current LPS symbol, 13 bits;

ARTICLE IN PRESS K. Liu et al. / INTEGRATION, the VLSI journal 43 (2010) 305–317

nlpsNlpsCh: the index number of the third symbol when the next two symbols are both LPS; nmpsNlpsCh: the index number of the third symbol when the next two symbols are MPS and LPS; nlpsNmpsCh: the index number of the third symbol when the next two symbols are LPS and MPS; nmpsNmpsCh: the index number of the third symbol when the next two symbols are both MPS. Among these fields, nlpsNlpsCh, nmpsNlpsCh, nlpsNmpsCh, and nmpsNmpsCh are used to update index when the two contexts are identical. As shown by the analysis above, all the fields can be calculated based on the index of corresponding context by appropriate logic gates. For this reason, the size of PETROM can be reduced significantly.

C

Qe0 +

<<

C_PRE

Qe1

+

CLK

+

C_PRE _SEL

0 NumSLA 0

=

CTM

<<

C_ADD 12

>> <<

-

311

NumSLA 1 +

CLK

20

BYTEOUT _TYPE >>

3.5. Estimations and conditions for the BYTEOUT

<<

+

Fig. 6. Calculation of the C value.

Table 3 The statistics of Qe table. Qe

Occurrence count

Position

Qe

Occurrence count

Position

0  5601 0  3401 0  1801 0  0ac1 0  0521 0  0221 0  5401 0  4801 0  3801 0  3001 0  2401 0  1c01 0  1601 0  0009 0  0005 0  0001

4 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1

0,6,14,46 1,19 2,25 3,30 4,33 5,36 7,15 8,17 9,18 10,20 11,22 12,24 13,26 43 44 45

0  5101 0  2801 0  2201 0  1401 0  1201 0  1101 0  09c1 0  08a1 0  0441 0  02a1 0  0141 0  0111 0  0085 0  0049 0  0025 0  0015

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

16 21 23 27 28 29 31 32 34 35 37 38 39 40 41 42

Table 4 The structure of PETROM. ntNlpsQe

ntNmpsQe

Qe

lz

13

13

13

4

ntNlpsQe: the probability value when the next symbol is an LPS, 13 bits; ntNmpsQe: the probability value when the next symbol is an MPS, 13 bits. For two symbols processing in parallel, some other information must be obtained immediately. The following fields are useful for our architecture: swch: the switch bit of symbol probability for the current index; nlps: the index number when the next symbol is an LPS; nmps: the index number when the next symbol is an MPS; ntNlpsSwch: the switch bit of symbol probability when the next symbol is an LPS; ntNmpsSwch: the switch bit of symbol probability when the next symbol is an MPS;

As two context symbols are being processed simultaneously, at most four bytes can be emitted at the same time. In order to define conditions for each BYTEOUT case, we must first find the number of renormalizations based on the probability interval A and the current probability value Qe of the LPS symbol. Then, from the number of renormalizations, we can derive the exact number of bytes emitted when two contexts are sent to the coder. Because a single context can incur at most one renormalization procedure, two contexts can incur four cases of renormalization: no renormalization, one renormalization for the first context and no renormalization for the second, no renormalization for the first and one renormalization for the second, and two renormalizations. For simplicity, we combine the second case with the third case and use RenormeZero, RenormeOne, RenormeTwo signals to stand for these cases. In Fig. 7, the detailed logic circuit is shown for these three case conditions when the different symbols arrive. There are several factors that can influence the number of renormalizations with two context symbols. The factors include the type of context symbol, the current probability interval A, and the two probability values Qe0, Qe1 for the LPS symbols. During the coding process, the code variables for the first context can be predicted as the base value for the second context. Then, the shift number of probability interval A is calculated, and three indicated signals RenormeZero, RenormeOne, and RenormeTwo are generated for the different renormalization cases. If the two context symbols are both LPS symbols, then there must be two renormalizations. For that reason, the LPSLPS situation is not shown in Fig. 7. As the prediction continues, the shift numbers of A for two context symbols are registered in the NumSLA0 and NumSLA1 registers, respectively. These two shift numbers are used to predict the types of BYTEOUT. The conditions for BYTEOUT will be explained in the next subsection. 3.5.1. No BYTEOUT with RenormeOne If NumSLA CurrentCT o0, then no BYTEOUT occurs. The NumSLA can be either NumSLA0 or NumSLA1, depending on the symbol type of the renormalization. The CurrentCT is the shift left number of the code register C. In this condition, the shift left number of A can not decrease CurrentCT to 0, and then there is no BYTEOUT procedure. In this case, the code register will be shifted left by NumSLA. 3.5.2. One BYTEOUT with RenormeOne There are two different conditions with one BYTEOUT case: (a) NewCT ¼ ¼ 7 and 0 rNumSLACurrentCT o 7 (b) NewCT ¼ ¼ 8 and 0 rNumSLACurrentCT o 8

ARTICLE IN PRESS 312

K. Liu et al. / INTEGRATION, the VLSI journal 43 (2010) 305–317

DIFF Qe1 NT_NMPS _Qe1

DIFF Qe1 NT_NMPS _Qe1

Qe1_MUX

Qe0

Qe1_MUX

Qe0

1

<<

A

LZ 0

<

<<

1 <

1

<< -

0x8000

<<

A

LZ 0

1

<<

<

<<

<

< - 0x8000 < RenormeTwo <<

-

SHIFT _NUM

0x8000 <

<

RenormeTwo

<< SHIFT _NUM <

Qe1_MUX

-

0x8000

RenormeOne

RenormeOne

< DIFF Qe1 NT_NMPS _Qe1

1

Qe1_MUX

<< Qe0

<

1

LZ 0

<<

<<

A

RenormeZero

1

<<

- 0x8000

< =

<< SHIFT _NUM

< RenormeTwo

0x8000 <

RenormeOne

Fig. 7. Different cases for the renormalization. (a) MPSMPS, (b) MPSLPS and (c) LPSMPS.

Either of these will lead to a single BYTEOUT procedure. The NewCT is determined by the previous byte emitted and the current C value. The logic of NewCT is If (previous byte is FFH) or (C Z0  8 000 000 and previous byte is FEH) then NewCT ¼ 7; Else

When there are two renormalizations, the corresponding BYTEOUT conditions are complicated. First, we must make a clear difference between the two shift numbers of the probability interval A, the NumSLA0 and NumSLA1. Then, the current C value, the shift number CT, and the previous byte emitted decide the number of BYTEOUT procedures. For two renormalizations, the number of code streams can be 0, 1, 2, 3, or 4 bytes. Sections 3.5.4–3.5.8 give the details.

NewCT ¼ 8 Under this condition, the current byte emitted will be C [27:20] or C [26:19], depending on the previous byte emitted and the current C value. After the BYTEOUT procedure, the C value will be shifted left by NumSLA CT. Each case for the BYTEOUT position is illustrated in the shadow in Fig. 8. 3.5.3. Two BYTEOUTs with RenormeOne If the above conditions cannot be satisfied, two BYTEOUTs will take place. The detailed position of bytes emitted in this case is shown in Fig. 9.

3.5.4. No BYTEOUT with RenormeTwo If (NumSLA0 CurrentCT o0) and (NumSLA1 (CurrentCT  NumSLA0)o0) then there is no byte emitted at all. From the above conditions, we can see that even two shift left numbers of A cannot decrease CurrentCT to 0. Thus, no byte will be sent to the code stream, and the code register C should be shifted left by NumSLA0 +NumSLA1. 3.5.5. One BYTEOUT with RenormeTwo There are two cases for one byte emitted when two renormalizations occur.

ARTICLE IN PRESS K. Liu et al. / INTEGRATION, the VLSI journal 43 (2010) 305–317

CASE 0. Each symbol emits only one byte. The conditions are:

‘0’ previous byte is FFH 31

27

20

0

current C < 0x8000000 26

31

19

0

C is shifted left by NumSLA -CT after a byte outputs Fig. 8. One BYTEOUT.

‘0’

31

313

‘0’

27

A bit is set to 0 if previous byte value is FFH 12 0

20 Byte0

Byte1

(a) ð0 r NumSLA0CurrentCT o 7Þ and ðNewCT0 ¼ ¼ 7Þ and ð0 rNumSLA1ð7ðNumSLA0CurrentCTÞÞ o 7Þ and ðNewCT1 ¼ ¼ 7Þ (b) ð0 r NumSLA0CurrentCT o 7Þ and ðNewCT0 ¼ ¼ 7Þ and ð0 rNumSLA1ð8ðNumSLA0CurrentCTÞÞ o 8Þ and ðNewCT1 ¼ ¼ 8Þ (c) ð0 r NumSLA0CurrentCT o 8Þ and ðNewCT0 ¼ ¼ 8Þ and ð0 rNumSLA1ð7ðNumSLA0CurrentCTÞÞ o 7Þ and ðNewCT1 ¼ ¼ 7Þ (d) ð0 r NumSLA0CurrentCT o 8Þ and ðNewCT0 ¼ ¼ 8Þ and ð0 rNumSLA1ð8ðNumSLA0CurrentCTÞÞ o 8Þ and ðNewCT1 ¼ ¼ 8Þ The above four conditions are OR logical relations. In fact, these are combinations of the two conditions for only one byte output from each symbol. The NewCT1 and NewCT0 can be set the same way described above. CASE 1. Two bytes are emitted by the first symbol, and no byte for the second symbol. The corresponding conditions are:

‘0’ current C < 0x8000000 31

26

19 Byte0

11

0

Byte1

C is shifted left by NumSLA -CT after bytes output Fig. 9. Two ByteOuts.

(a) ðNumSLA0CurrentCT Z 7Þ and ðNewCT0 ¼ ¼ 7Þ and ðNumSLA1CurrentCT00 o 0Þ (b) ðNumSLA0CurrentCT Z 8Þ and ðNewCT0 ¼ ¼ 8Þ and ðNumSLA1CurrentCT00 o 0Þ Either of the two conditions will lead to CASE 1. CurrentCT00 is the shift left number of C after the first symbol updates. In fact, CurrentCT00 ¼ NewCT0ðNumSLA0CurrentCTNewCT0Þ

CASE 0. One byte is emitted by the first symbol, no byte by the second symbol. The corresponding conditions are: (a) ððNewCT ¼ ¼ 7Þ and ð0 r NumSLA0CurrentCT o 7ÞÞ or ððNewCT ¼ ¼ 8Þ and ð0 r NumSLA0CurrentCT o 8ÞÞ (b) ðNumSLA1ð7ðNumSLA0CurrentCTÞÞ o 0Þ or ðNumSLA1ð8ðNumSLA0CurrentCTÞÞ o 0Þ Both conditions must be satisfied for this case. The first condition shows only one byte output for the first symbol, and the second one shows no byte output for the second symbol. CASE 1. No byte is emitted by the first symbol, one byte by the second symbol. The corresponding conditions are: (a) ðNewCT ¼ ¼ 7Þ and ð0 rNumSLA1ðCurrentCTNumSLA0Þ o 7Þ and ðNumSLA0CurrentCT o 0Þ (b) ðNewCT ¼ ¼ 8Þ and ð0 rNumSLA1ðCurrentCTNumSLA0Þ o 8Þ and ðNumSLA0CurrentCT o 0Þ If either of the two conditions is satisfied, this case can occur. The two conditions describe two situations in which there will be only one byte output from the second symbol when no byte is emitted for first symbol. The calculation of NewCT is the same as in the condition for one byte output when only one renormalization occurs. 3.5.6. Two BYTEOUTs with RenormeTwo For two BYTEOUTs, there are three different cases. The corresponding results are shown in Fig. 9.

CASE 2. Two bytes are emitted by the second symbol, and no byte for the first symbol. The corresponding conditions are: ððNumSLA0CurrentCT o0Þ and ððNumSLA0CurrentCT00 Z 7Þ and ðNewCT1 ¼ ¼ 7ÞÞÞor ððNumSLA0CurrentCT00 Z 8Þ and ðNewCT1 ¼ ¼ 8ÞÞ In this condition, CurrentCT00 ¼CurrentCT  NumSLA0. 3.5.7. Three BYTEOUTs with RenormeTwo Two cases for three bytes emitted are shown in Fig. 10. CASE 0. Two bytes are output by the first symbol, one byte by the second symbol. The two OR conditions are: (a) ðNumSLA0CurrentCT Z 7Þ and ðNewCT0 ¼ ¼ 7Þ and ðð0 r NumSLA1CurrentCT00 o7Þand ðNewCT1 ¼ ¼ 7Þ or ð0 r NumSLA1CurrentCT00 o8Þ and ðNewCT1 ¼ ¼ 8ÞÞ (b) ðNumSLA0CurrentCT Z 8Þ and ðNewCT0 ¼ ¼ 8Þ and ðð0 r NumSLA1CurrentCT00 o7Þand ðNewCT1 ¼ ¼ 7Þ or ð0 r NumSLA1CurrentCT00 o8Þ and ðNewCT1 ¼ ¼ 8ÞÞ These conditions are combination of two bytes output and one byte output. CASE1. Two bytes are output by the second symbol, one byte by the first symbol. The two OR conditions are: (a) ð0 r NumSLA0CurrentCT o 7Þ and ðNewCT0 ¼ ¼ 7Þ and ððNumSLA1CurrentCT00 Z7Þand ðNewCT1 ¼ ¼ 7Þ or ðNumSLA1CurrentCT00 Z 8Þ and ðNewCT1 ¼ ¼ 8ÞÞ (b) ð0 r NumSLA0CurrentCT o 8Þ and ðNewCT0 ¼ ¼ 8Þ and ððNumSLA1CurrentCT00 Z7Þand ðNewCT1 ¼ ¼ 7Þ or ðNumSLA1CurrentCT00 Z 8Þ and ðNewCT1 ¼ ¼ 8ÞÞ

ARTICLE IN PRESS 314

K. Liu et al. / INTEGRATION, the VLSI journal 43 (2010) 305–317

Fig. 10. (a) Three BYTEOUTs and (b) three BYTEOUTs.

3.5.8. Four BYTEOUTs with RenormeTwo The last situation is two bytes output for each symbol. The output position is illustrated in Fig. 11, and the corresponding conditions are: (a) ðNumSLA0CurrentCT Z7Þ and ðNewCT0 ¼ ¼ 7Þ and ððNumSLA1CurrentCT00 Z7Þand ðNewCT1 ¼ ¼ 7Þ or ðNumSLA1CurrentCT00 Z 8Þ and ðNewCT1 ¼ ¼ 8ÞÞ (b) ðNumSLA0CurrentCT Z8Þ and ðNewCT0 ¼ ¼ 8Þ and ððNumSLA1CurrentCT00 Z7Þand ðNewCT1 ¼ ¼ 7Þ or ðNumSLA1CurrentCT00 Z 8Þ and ðNewCT1 ¼ ¼ 8ÞÞ Thus far, the different situations for the BYTEOUT procedure have been described carefully when two context symbols are processed in parallel. In practice, we can use FIFO to buffer the output of the code stream for speed balance. In the next section, we will explain the entire timing order for our arithmetic coder. 3.6. Timing The timing order of the proposed MQ encoder is drawn in Fig. 12. In Fig. 12, the SHIFT module computes the shift number related with two LPS probabilities, Qe0 and Qe1. The Delay module stands for one clock delay. The A_CAL and C_CAL modules are used for the interval A and the C register calculations, respectively. At the same time, the NumSLA0 and NumSLA1 modules respond for the shift number of A. As shown in Fig. 12, the latency of the pipeline is only five clock cycles, and there is no feedback path to stall pipelining. We can pipeline the execution of our MQ encoder by simply starting two new CXD pairs on each clock cycle.

Fig. 11. Four BYTEOUTs.

ARTICLE IN PRESS K. Liu et al. / INTEGRATION, the VLSI journal 43 (2010) 305–317

315

Time (in clock cycles)

CC1

CC2

CC3

Delay

A_CAL

C_CAL

ByteOut

Delay

SHIFT

NumSL A0

Delay

A_CAL

C_CAL

ByteOut

Delay

SHIFT

NumSL A0

Delay

A_CAL

C_CAL

ByteOut

SHIFT

NumSL A0

(CX0,D0) (CX1,D1)

CC4

CC5

CC6

CC7

NumSL A1

(CX2,D2) (CX3,D3)

NumSL A1

(CX4,D4) (CX5,D5)

Delay

NumSL A1

Fig. 12. The timing order of the pipeline.

Table 5 The synthesis results of different FPGA devices.

Table 6 Critical path results for each part using XC4VLX80 in nanosecond.

FPGA devices

Slices

LCs

Memory (bits)

Clk (MHz)

Throughput (MSPS)

XC4VLX80 STRATIX

6974 –

– 12649

1509 1509

48.30 40.53

96.60 81.06

Although each two CXD pairs take five clock cycles to complete, during each clock cycle the encoder will initiate new CXD pairs and execute some part of the five different pairs. At the output port, the compressed code stream will be watched on every clock

Part

Delay

A_CAL

C_CAL

BYTEOUT

Critical path

5.69

20.01

20.70

15.88

cycle to see whether the corresponding CXD pairs generate code bytes. One important characteristic of the encoder pipeline is that there are no data hazards that can make it necessary to stall the pipeline. The encoder pipeline is irrelevant to the context label,

ARTICLE IN PRESS 316

K. Liu et al. / INTEGRATION, the VLSI journal 43 (2010) 305–317

Table 7 Storage and throughput results using FPGA device. 2CXD hypothesis

2CXD hypothesis with queue

3CXD hypothesis

3CXD hypothesis with queue

MEM (bits)

TP (MSPS)

MEM (bits)

TP (MSPS)

MEM (bits)

TP (MSPS)

MEM (bits)

TP (MSPS)

3302

66.38

3302

73.25

3929

60.78

3929

71.48

Brute force

Brute force with modified Byte out

Proposed method

MEM (bits)

TP (MSPS)

MEM (bits)

TP (MSPS)

MEM (bits)

TP (MSPS)

8192

73.80

8192

97.70

1509

81.06

5. Conclusions

Table 8 TSMC 0.18-um synthesis results. Architecture

Gate area (mm2) Clk (MHz) TP (MSPS)

2CXD hypothesis 196171.12 2CXD hypothesis with queue 199234.73 3CXD hypothesis 248033.02 3CXD hypothesis with queue 268699.94 Brute force 384817.91 Brute force with modified BYTE OUT 381345.14 Proposed method 321415.03

249.38 248.14 201.61 205.34 194.17 211.86 220.10

306.74 334.99 268.14 301.85 388.34 423.72 440.20

which allows the encoder to deal consistently with two CXD pairs per clock cycle.

4. Experimental results We implement our dual symbol MQ encoder by VHDL, using Xilinx’s and Altera’s FPGA as target devices. Table 5 shows the final implementation results on the two FPGA platforms as reported by their default synthesis tools, i.e., XST for Xilinx and the Quartus layout tool for Altera. Table 6 shows the detailed critical path information on the Xilinx’s FPGA platform. The number of memory bits required by the FPGA system consists of two memory blocks: PETROM, with 32  43 bits, and INDEXRAM, with 19  7bits. Thus, there are 1509 bits in our dual symbol MQ encoder. For comparison, we also list memory bits and throughput results for [8] in Table 7. (Note: MEM stands for memory bits; TP stands for throughput using MSPS as unit. All results are reported on the Altera’s Stratix platform.) From the above table reported by the FPGA system, we can see that our encoder’s throughput is higher than those of Dyer et al.’s [8] architectures, except for the brute force with modified byte out method. However, our memory size is the smallest of all the architectures. We use 2048 bits to a power of two in the FPGA devices. This memory capacity is just one quarter of that required by Dyer et al.’s [8] fastest method; that is, 2048/8192¼0.25. Considering of memory size and throughput, we believe that our architecture is slightly superior to Dyer et al.’s [8] best method, i.e., brute force with modified byte out method. Because other architectures reported by Gupta et al. [4], Ong et al. [5], Chiang et al. [6], and Zhang et al. [7] cannot reach two CXD pairs per clock cycle consistently, we do not list their results here. For a careful comparison, we also synthesize our proposed architecture with the DC tools, using the TSMC 0.18 mm technology library. Table 8 lists the results of different methods for ASIC implementation. As shown in Table 8, the proposed architecture uses a small area to get high throughput.

In this paper, we describe a high performance architecture of an MQ arithmetic coder with two contexts processed in parallel, based on our analysis of the MQ algorithm in JPEG2000. The encoder uses multiple PEs to predict the probability interval A, then computes the code string register C and merges the BYTEOUT process with a bit-stuffing process. The architecture can deal with two context and decision pairs simultaneously, bearing no relation to the context labels. According to synthesis results, our architecture can reach 96.60 MSPS at its maximum speed, which is nearly Dyer et al.’s [8] best architecture’s speed, while occupying only 25% of the memory bits required by that counterpart. Therefore, the proposed architecture is a good candidate for use as a high speed real-time JPEG2000 encoder in various applications.

Acknowledgments The authors would like to thank the reviewers for their helpful comments and revisions. This material is based upon work supported by the National Natural Science Foundation of China under Grant nos. 60802076, 60633020, and 60872041, the Fundamental Research Funds for the Central Universities under Grant No. JY10000903003 and the 111 Project under Grant B08038 and PCSITR.

References [1] ITU T.800, JPEG2000 Image Coding System Part 1, ITU Standard, July 2002. [2] D. Taubman, E. Ordentlich, M. Weinberger, G. Seroussi, I. Ueno, F. Ono, Embedded block coding in JPEG2000, in: Proceedings of the International Conference on Image Processing (ICIP’02), vol. 2, 2000, pp. II33–II36. [3] C.-J. Lian, K.-F. Chen, H.-H. Chen, L.-G. Chen, Analysis and architecture design of block-coding engine for EBCOT in JPEG2000, IEEE Trans. Circuits Systems Video Technol. 13 (3) (2003) 219–230. [4] A.K. Gupta, D. Taubman, S. Nooshabadi, High speed VLSI architecture for bit plane encoder of JPEG2000, in: Proceedings of the Midwest Symposium on Circuits Systems (MWSCAS’04), vol. 2, 2004, pp. II233–II236. [5] K.-K. Ong, W.-H. Chang, Y.-C. Tsenf, Y.-S. Lee, C.-Y. Lee, A high throughput low cost context-based adaptive arithmetic codec for multiple standards, in: Proceedings of the International Conference on Image Processing (ICIP’02), vol. 1, 2002, pp. I872–I875. [6] Jen-Shiun Chiang, Chun-Hau Chang, Yu-Sen Lin, Chang-You Hsieh, Chih-Hsien Hsia, High-speed EBCOT with dual context-modeling coding architecture for JPEG2000, Proc. IEEE Int. Symp. Circuits Systems 3 (2004) 865–868. [7] Yi-Zhen Zhang, Chao Xu, and Liang-Bin Chen, A. Dual-symbol coding arithmetic coder architecture design for high speed EBCOT coding engine in JPEG2000, in: Proceedings of the International Conference on Image Processesing (ICIP’05), vol. 1, 2005, pp. I322–I325. [8] M. Dyer, D. Taubman, S. Nooshabadi, Concurrency techniques for arithmetic coding in JPEG2000, IEEE Trans. Circuits Systems—I: Regular Papers 53 (6) (2006) 1203–1213. [9] W.B. Pennebaker, J.L. Mitchell, G.G.L. Jr, R.B. Arps, An overview of the basic principles of the q-coder adaptive binary arithmetic coder, IBM J. Res. Dev. 32 (6) (1988) 717–726.

ARTICLE IN PRESS K. Liu et al. / INTEGRATION, the VLSI journal 43 (2010) 305–317

[10] D.S. Taubman, M.W. MarcellinJPEG2000 Image Compression Fundamentals, Standards and Practise, Kluwer, Norwell, MA, 2002, pp. 473–484 (Chapter 12).

317

Yun Song Li is a professor of communication at the Xidian University, China. He received his Ph.D. in signal processing from the Xidian University in 2002. His major research interests include spectral image coding and image analysis.

Kai Liu received the B.S. and M.S. degrees in computer science from Xidian University, Xi’an, China, in 1999 and 2002, respectively. In 2005, he received his Ph.D. in signal processing from the Xidian University. Now he is an associate professor of computer science and technology at the Xidian University, China. His major research interests include VLSI architecture design and image coding. Jian Feng Ma was a professor of computer school at the Xidian University, China. He is now dean of computer school. His major research interests include computer security, information theory.

Yu Zhou received his M.S. degree in signal processing from the Xidian University in 2009. He is now an employee of ZTE company in ShenZhen, China.