J. Parallel Distrib. Comput. 73 (2013) 411–419
Contents lists available at SciVerse ScienceDirect
J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc
Parallel design for error-resilient entropy coding algorithm on GPU✩ Yuan Dai a,b , Yong Fang b , Dongjian He a,∗ , Bormin Huang c a
College of Mechanical and Electronic Engineering, Northwest A&F University, China
b
Lab. Agri- and Bio-Information Processing, College of Information Engineering, Northwest A&F University, China
c
Space Science and Engineering Center, University of Wisconsin-Madison, USA
article
info
Article history: Received 6 May 2012 Received in revised form 9 September 2012 Accepted 12 December 2012 Available online 23 December 2012 Keywords: P-EREC EREC Parallel processing GPU CUDA
abstract The error-resilient entropy coding (EREC) algorithm is an effective method for combating error propagation at low cost in many compression methods using variable-length coding (VLC). However, the main drawback of the EREC is its high complexity. In order to overcome this disadvantage, a parallel EREC is implemented on a graphics processing unit (GPU) using the NVIDIA CUDA technology. The original EREC is a finer-grained parallel at each stage which brings additional communication overhead. To achieve high efficiency of parallel EREC, we propose partitioning the EREC (P-EREC) algorithm, which splits variablelength blocks into groups and then every group is coded using the EREC separately. Each GPU thread processes one group so as to make the EREC coarse-grained parallel. In addition, some optimization strategies are discussed in order to obtain higher performance using the GPU. In the case that the variablelength data blocks are divided into 128 groups (256 groups, resp.), experimental results show that the parallel P-EREC achieves 32× to 123× (54× to 350×, resp.) speedup over the original C code of EREC compiled with the O2 optimization option. Higher speedup can even be obtained with more groups. Compared to the EREC, the P-EREC not only achieves a good speedup performance, but it also slightly improves the resilience of the VLC bit-stream against burst or random errors. © 2013 Elsevier Inc. All rights reserved.
1. Introduction For the long term, there is an increasing requirement for error resilient image coding techniques as image data are usually transmitted through band-limited and high bit-error-rate (BER) fading communication channels, e.g. wireless and mobile channels [6]. Most of the current image compression algorithms work by splitting the image into blocks and then generating variable-length codes for each block. However, variable-length codes are known to be highly susceptible to channel errors. Hence, many compression standards insert some redundant information, e.g. resynchronization markers, into the bit-streams in order to restart the decoder from a known position in the event of transmission errors [12]. The error-resilient entropy coding (EREC) is a robust method which can make the decoder regain synchronization at the beginning of each block with little redundant information [18]. It works by reorganizing variable-length data blocks (VLBs) into fixed-length slots (FLSs) such that each block starts at a known position, and hence the
✩ This work was supported by the National Science Foundation of China (Grant Nos. 60975007, 61271280, 61001100) and Provincial Science Foundation of Shaanxi, China (Grant No. 2010K06-15). ∗ Corresponding author. E-mail addresses:
[email protected] (Y. Dai),
[email protected] (D. He).
0743-7315/$ – see front matter © 2013 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2012.12.008
decoder is automatically synchronized at the start of each block. Therefore, the EREC effectively protects the bits at the beginning of each VLB and reduces the channel error propagation effects, even though error propagation is notable at the higher frequencies that are less visible in images [18]. The EREC has been applied to both still image and video compression schemes. For example, in [11,22,25], the EREC is adopted for improving the error resilience of images and videos to channel bit error at the same time of maintaining high compression efficiency. Their experimental results show that this algorithm is economical and robust. However, the EREC faces the disadvantage of high computational complexity, and it will take much more computational time as the block number increases, which makes it difficult to satisfy the requirement of real time processing and restricts its widespread applications. A common solution to this problem is to optimize the algorithm. R. Chandramouli et al. propose a fast error resilient entropy coding algorithm (FEREC) for improving the search strategy. They make full use of the correlation of the lengths of adjacent blocks to enhance the efficiency of bit placement [6]. The FEREC is almost twice as fast as the EREC and its error resilience capability is also observed to be better. In [12], an improvement to FEREC is introduced, which uses the correlation between the code lengths of successive blocks. It first calculates the initial searching position according to bit lengths of consecutive blocks. Then the initial offset value is determined by statistical distribution of long and
412
Y. Dai et al. / J. Parallel Distrib. Comput. 73 (2013) 411–419
short blocks. Finally, the value is adjusted to insure all offset sequence values can be used. The algorithm can run almost twice as fast as FEREC by reducing the iteration number. In addition, N. T. Cheng et al. also discuss the searching strategy and find that the pseudo-random offset sequence is better than the certainty sequence [7]. However, all of them pay attention to the searching strategy of the EREC, which makes it difficult to enhance the performance to a great extent. Hence, it still remains a challenging research problem to speedup the EREC due to its high computational cost. In this paper, we attempt to optimize the performance of the EREC by parallel processing. Parallel processing is a good candidate for increasing computational speed, and it can be exploited in a cost effective manner [1]. In practice, we can resort to parallel computing with the graphics processing unit (GPU) for accelerating the computation of the EREC. The GPU has become increasingly competitive in regards to performance, programmability, price, power consumption, and scale in recent years [17]. Moreover, it also becomes increasingly suitable for general-purpose computations. Our desire is to boost the performance of the EREC on GPU using CUDA technology, which in turn will give us more freedom in utilizing the parallelism. In this paper, we will first implement a parallel algorithm based on the original EREC on GPU, which provided a poor result. Then in order to obtain better performance of the EREC, the partitioning EREC (P-EREC) algorithm must be introduced. It works by splitting VLBs into multiple groups. In each group, VLBs are placed into FLSs, which can be coded independently using the EREC simultaneously. Each GPU thread may process one group so that it realizes the coarse-grained parallelization and shows a good performance. This paper is arranged as follows. Section 2 provides a brief review on EREC, GPU computing, CUDA architecture and the specification of NVIDIA GTX480. Section 3 introduces the implementation of the EREC on CPU. Section 4 deals with GPU implementation of the EREC and gives the experimental results. Section 5 focuses on GPU implementation of the P-EREC and discusses some useful optimization methods on GPU with CUDA. Then the experimental results are demonstrated and the performance of the P-EREC is analyzed finally. The conclusions are presented in Section 6.
where ‘‘%’’ is the mod operation and b = L/N. The EREC encoding process includes N stages. The first stage involves allocating each block of data to a corresponding code slot. Starting from the beginning of each block, all or as many bits as possible are placed within the corresponding slot. Then at stage n, VLBi with unplaced data searches FLS j (j = i + φn (mod N )) for blank bits to place as many remaining data as possible, where φ is a predefined search offset sequence. Fig. 1 shows an example of the operation process of the EREC algorithm with six blocks of lengths 10, 5, 4, 7, 2, and 8 into six slots of size 6. At stage 1, as many bits as possible are placed into slots, after stage 1 (Fig. 1(a)), slots 1, 4, and 6 are full, slots 2, 3, and 5 have space left. At stage 2, one bit from block 1 is placed into slot 2 while the remaining one bit from block 4 is placed into slot 5 (Fig. 1(b)). At stage 3, as we can see in Fig. 1(c), the two bits from block 1 are placed in slot 3. There is no bit placed at stage 4, and then block 1 searches slot 5 and places the remaining one bit into slot 5 at the end of stage 5. Fig. 1(d) shows that by the end of stage 6, all the data has been placed. In case of error-free transmission, the decoder can follow the same algorithm to recover the VLBs as it goes along until it finds the end of each block, which only needs to know the total transmitted bits and the number of the VLBs [18].
2. Related works
2.3. CUDA architecture
2.1. Review on EREC
CUDA architecture is developed by NVIDIA, and it is a wellknown parallel programming model for general-purpose computing on graphics processing units (GPGPU) for writing highly parallel applications [21]. It is a hardware and software architecture that issues and manages data-parallel computations on a GPU [10]. CUDA provides fine-grained data parallelism and thread parallelism, which are nested within coarse-grained data parallelism and task parallelism. They guide us to partition the problem into coarse sub-problems that can be executed independently, and then into finer pieces that can be solved cooperatively in parallel [24]. The CUDA programming model is an extension of the C language and it maintains a low learning curve for programmers [24]. The CUDA computing system consists of a host and one or more devices such as GPU which are known as massively parallel processors equipped with a large number of arithmetic execution units. The device code is executed in parallel and written using C extended with keywords for labeling data-parallel functions, called kernels. A kernel is executed in parallel across a set of parallel threads in a Single Instruction Multiple Thread (SIMT) model [4]. Threads are organized as a grid of thread-blocks and each thread will execute the same kernel code. In addition, the host and device codes are executed in two different memory spaces, so the host code must use special calls, e.g. host-to-device and device-to-host for data transfers [2].
The EREC algorithm provides an error-resilient scheme for coding the data produced by many block-based data compression algorithms. It is applicable to block coding strategies where the input signal is split into blocks that are coded as VLBs. Notice that each block must be a prefix code such that in the absence of channel errors, the decoder knows when it reaches the end of each block (EOB). Most entropy-based source coding schemes satisfy this demand, e.g. Huffman coding or arithmetic coding [18]. The EREC reorganizes N VLBs into N FLSs. Let bi be the length of the ith block, Si be the length of the ith slot and L be the total length of blocks. The encoder needs to choose a total bit length T which is sufficient to code all the data. Their relation is given by formula (1) [12]. A suitable value T must be selected in order to minimize the redundancy: T =
N
Si ≥
i =1
N
bi .
(1)
i =1
If T equals L, there is no redundancy to implement the EREC coding and decoding. Therefore, Si is determined by formula (2) [11],
Si =
⌈b⌉ ⌊b⌋
0 ≤ i < L%N L%N ≤ i < N
(2)
2.2. GPU computing The computing system consists of a host that is a traditional CPU, while GPU can be regarded as a computer device or coprocessor. GPU is a single-instruction multiple-data (SIMD) parallel device. Therefore, we can take data-parallel computing of computationally intensive portions of the algorithm or application on GPU [10]. Recently, there is an increasing interest in performing general-purpose computations on GPUs with the introduction of the CUDA technique. CUDA gives us more freedom in utilizing the parallelism and allows GPU from theory to tend towards the practical application for general-purpose computing. Thus GPU has already been applied to various fields, including image processing [9], computational geometry [5,14], video coding [8], and database operations [3], as well as route planning [23].
Y. Dai et al. / J. Parallel Distrib. Comput. 73 (2013) 411–419
413
Table 1 Run time of EREC on CPU. Block number Total bits Run time (ms)
4 096 509 744 125
8 192 1 024 506 484
16 384 2 048 810 1 636
32 768 4 085 383 6 209
57 344 7 133 856 15 450
65 536 8 149 850 19 994
73 728 9 170 797 25 401
81 920 10 184 121 31 307
Fig. 1. Example of six-stage EREC (searching sequence φn : {0, 1, 2, 3, 4, 5}).
Algorithm 1: Reorganization N VLBs into N FLSs Inputs: blks[0 · · · N-1], ibptr[0 · · · N], isptr[0 · · · N]; phi[0 · · · N-1]; Output: slts[0 · · · N-1]; 1. bptr[0 · · · N] ← ibptr[0 · · · N]; sptr[0 · · · N] ← isptr[0 · · · N]; 2. unplaced ← 0; unfilled ← 0; nbits ← 0; 3. for s ← 0 to N-1 4. for i ← 0 to N-1 5. j ← (i+phi[s])%N; 6. unplaced ← ibptr[i+1]-bptr[i]; 7. unfilled ← isptr[j+1]-sptr[j]; 8. nbits ← Min(unplaced,unfilled); 9. if nbits > 0 then 10. movbits(slts, blks, sptr[j], bptr[i], nbits); 11. bptr[i] ← bptr[i]+nbits; 12. sptr[j] ← sptr[j]+nbits; 13. end if 14. end for 15. end for 2.4. Specification of NVIDIA GTX480 In our work, we use the GPU of NVIDIA GTX480 with CUDA capability version 2.0. NVIDIA GTX480 has 15 Streaming Multiprocessors (SMs), each of which has 32 Streaming Processors (SPs) so that there are a total of 480 processor cores. It has 32 kB registers and 48 kB shared memory per SM. Both of them can provide highspeed data access. In addition, there are two kinds of read-only memory, i.e., 64 kB constant memory and texture cache. 3. Implementation of EREC on CPU The EREC encoding process is described with Algorithm 1. Detailed explanations are as follows. (1) Inputs and output. The inputs are blks, ibptr, isptr and phi. blks comes from VLBs. ibptr (isptr, resp.) is used for noting the beginning positions of VLBs (FLSs, resp.). phi is a search offset sequence. The output slts is prepared for the data of FLSs. blks and slts are arrays using bit storage for the binary code so that searching and placing bits can take advantage of bit operation to boost the speed except for saving storage space. (2) Variables. unplaced records the number of unplaced bits of blki and unfilled shows how many blank bits remained in sltsj currently. nbits notes the number of bits to be moved from blki to sltsj at the present stage, it is the smaller number of unplaced and unfilled. s represents the iteration stage. (3) Position update: bptr i (sptr j , resp.) denotes the starting location of unplaced bits (unfilled bits, resp.) in blki (sltsj , resp.) before the current stage. They need to be updated at the end of each stage.
Table 2 The performance of the original EREC on GPU. Block number
GPU processing time (ms)
Speedup
32 768 57 344 65 535 73 728 81 920
2901 4938 5930 6722 7685
2.1 3.1 3.3 3.7 4.0
(4) Bits search and place. If nbits > 0, it shows there are nbits unplaced bits of blki that can be moved into sltsj . The operational process is realized by the function of movbits at line 10 in Algorithm 1. After placing bits, bptr i and sptr j must be updated for the next stage. In addition, searching and placing binary codes use bit operations in order to enhance the speed of the execution. We run experiments on an Intel core i7 CPU with 3.07 GHz. Table 1 shows the run time of the EREC on a CPU with the O2 optimization option. 4. Implementation of EREC on GPU It is found that at each stage, VLBi with remaining bits searching FLS j with blank bits can be executed in parallel. N stages cannot be parallelized because the current stage needs to use the position data bptr i (sptr j , resp.) updated at the previous stage. There is an iteration relationship between two successive stages. Fig. 2 shows the model of the original EREC executed in parallel on GPU. At every stage, each thread is assigned to one VLB to accomplish its task, including searching bits, placing bits and noting the positions. cudaThreadSynchronize() is used to synchronize all the threads after the current stage, and then the threads go along with the next stage. The parallel part of the EREC as a kernel function is executed across a set of threads in a SIMT model. Besides, texture memory, shared memory and other techniques are used to optimize the code on GPU. Table 2 shows the performance for part of the data in Table 1 with the original EREC on GPU. Compared to the CPU code, it only gains a slightly better performance because the parallel EREC performs inefficiently on GPU. There are three reasons which lead to such a poor result. Firstly, the EREC only contains finergrained parallel at each stage which accounts for a small fraction of the algorithm. Secondly, launching the kernel function repeatedly takes a lot of time. Thirdly, this pattern brings the additional communication overhead that is caused by transmitting a new search offset value to update the old one at each stage. In this way, the performance of the original EREC parallelization does not satisfy the requirements of practical application. Moreover, it does not fully utilize the enormous power of the GPU.
414
Y. Dai et al. / J. Parallel Distrib. Comput. 73 (2013) 411–419
Fig. 2. The model of original EREC executed in parallel. Table 3 P-EREC performance on GPU.
Fig. 3. Model of P-EREC executed on GPU.
5. Parallel P-EREC on GPU For the sake of higher speedup of the EREC, we introduce partitioning the EREC (P-EREC) algorithm which divides the N VLBs into n groups. For each group, the VLBs are placed into FLSs using the EREC separately. On account that the P-EREC can reduce the computational complexity, it has better performance than the EREC even on CPU. On this basis, we further optimize the algorithm on GPU to boost the performance. 5.1. Implementation of P-EREC on GPU The principle of the P-EREC is that every group is coded independently using the original EREC algorithm. Each GPU thread is assigned to one VLB group so that it realizes the coarse-grained parallel which can be executed effectively on GPU. Moreover, there is no additional communication overhead between GPU and CPU because it is unnecessary to load the kernel function and update the offset data at each stage. Fig. 3 demonstrates the model of the P-EREC executed on GPU. At the same time, the data of VLBs and FLSs are also stored in the same way as optimized CPU code. We also define one array blks (slts, resp.) for input data (output data, resp.) rather than n
Block number
GPU processing time (ms)
Speedup
32 768 57 344 65 535 73 728 81 920
205 473 578 706 835
30 32 34 35 37
arrays, because the capacity of the array is greater than or equal to bit number of the storage data. If we define n arrays, there may be blank bits in each array and they can become redundancy information. Therefore, the blocks (slots, resp.) are split into groups in the form, but the data are stored together. It ensures the blank bits which may only appear in the last several bytes of blks (slts, resp.). Let n_split = N /n, and suppose n_split as an integer. That is, N blocks are split into n groups and there are n_split blocks in each group. The total bits of every group must be known in order to distribute slot length S in the corresponding group using formula (2). We need to launch n threads and each of them encodes one group data with n_split blocks. It can be seen by the framework of parallel P-EREC in Fig. 3: all blocks from the first block to the n_splitth block search the slots in the first group and place all the bits, which are accomplished by thread 0. VLBn_split − VLB2∗n_split −1 place all the bits into the slots of the second group while thread 1 is assigned to process the task, and the rest can be done in the same manner. Instruction optimization with GPU is more critical than CPU, especially on GPU with CUDA capability 2.0. At each stage, the number of unplaced bits (unfilled bits, resp.) in each block (slot, resp.) must be computed. It also needs to compute how many bits can be placed into a slot. Then these bits are moved from the block to the slot, which can be processed using bit operation. It replaces expensive operations such as division, mod, multiplication, and so forth. The performance of the parallel P-EREC on GPU is shown in Table 3 (part of the data of Table 1). The data of N VLBs in the table are all divided into 128 groups and 32 threads per block are allocated in the grid. The parallel P-EREC achieves good performance compared with the sequential CPU code of the EREC.
Y. Dai et al. / J. Parallel Distrib. Comput. 73 (2013) 411–419
415
Fig. 4. Code of using constant memory and texture memory.
5.2. Optimization of P-EREC on GPU In addition to the algorithm optimization using the CPU, some optimization is achieved by GPU with CUDA in order to boost the execution efficiency of CUDA kernels. Memory access, occupancy, data transmission and other optimized strategies are discussed as follows. 5.2.1. Memory access optimization GPU has several various types of memory available, each of them with different characteristics and management needs. Global, local, and texture memory are plentiful but have higher memory latency while register, shared, and constant memory are scarce but have lower memory latency. It is a critical factor for a good GPU program to contrast slow and plentiful, and fast and small [15]. Our CUDA kernel achieves only a small fraction of the potential speed of the underlying hardware because global memory is implemented with dynamic random access memory (DRAM) and tends to have long access latencies (hundreds of clock cycles) and finite access bandwidth [13]. We should remove some of the data requests from global memory to other types of memory that can speedup data access. Constant memory can be used for data that read-only in a kernel execution. If every thread in a half-warp requests data from the same address in constant memory, GPU will generate only a single read request and subsequently broadcast the data to each thread [20]. Note that our GPU only provides 64 kB constant memory. We cannot store a great quantity of data in constant memory. In our work, we take advantage of constant memory for the offset sequence dev_phi which is specified the same in each group. In this way, each thread processes the same offset sequence. That is, each thread in half-warp reads the offset sequence from the same address. In such a situation, GPU will generate only a single read request and broadcast the data to other threads, which can achieve high access efficiency. It contributes to increasing the kernel performance. So it should be pretty obvious what data we will store in constant memory. Offset data in constant memory is expressed at line 1 and line 4 in Fig. 4. It is found that we use cudaMemcpyToSymbol() to copy the data from CPU to GPU. Fig. 5 shows the performance in this case when using constant memory instead of global memory. Like constant memory, texture memory is another variety of read-only memory. So in some situations it will provide higher
Fig. 5. The performance of using constant and texture memory.
effective bandwidth by reducing memory requests to off-chip DRAM. Specifically, texture caches are designed for graphics applications where memory access patterns exhibit a great deal of spatial locality [20]. The position data of the beginning of each block (slot, resp.) dev_ibptr(dev_isptr, resp.) are read only, and it includes mass data. These data can be moved from global memory to texture memory. Fig. 4 shows how we use texture memory in our CUDA implementation. First, we need to declare the data as texture references (lines 2 and 3). After allocating memory for two buffers, the references need to bind to the memory buffer using cudaBindTexture() (lines 5, 6). At this point, textures are completely set up. However, when reading from textures in the kernel, it needs to use special functions tex1Dfetch() to instruct the GPU to route our requests through the texture unit and not through standard global memory (lines 7 and 8) [16,20]. An advance in performance when using texture memory is demonstrated in Fig. 5. 5.2.2. Block and thread allocation for high performance Both the dimension and size of the CUDA block and grid are important. It will be convenient to map the parallel problems to CUDA architecture if selecting a suitable dimension, but it has little effect on the performance. Size plays an important role on the processing performance. The latency of accessing and the
416
Y. Dai et al. / J. Parallel Distrib. Comput. 73 (2013) 411–419
Fig. 7. Code of using shared memory. Fig. 6. The performance of various thread numbers.
occupancy of GPU depend on the number of the active warps in each stream multiprocessor (SM). In our work, N VLBs are divided into n groups. It means that only n threads are needed to launch the parallel processing of the algorithm. Thread allocation determines not only the block size but also the grid size because these resources are dynamically partitioned and assigned to support their execution. If each thread block contains 32 threads, the n threads are partitioned and assigned to n/32 blocks in the grid. We take the experiments with different thread allocations using the same implementation. Fig. 6 shows the speedup of different data sizes with various thread numbers in one thread block. It is found that the performances are almost the same if each thread block consists of 64 or 32 threads. When threads per block reduce to 16, the performance has an obvious advancement. And it performs the best in the case of 4 threads per block. On the one hand, using fewer threads ensures running multiple warps per SM so that GPU can hide both Arithmetic Logic Unit (ALU) and memory latencies to keep the execution units busy. On the other hand, in our work, each thread processes one VLB group with a great quantity of data that includes many automatic variables other than arrays. So higher occupancy does not mean better performance, and lower occupancy means that more registers can be used per thread because the register is another dynamically partitioned resource. What is more important is that registers are a precious resource as they are fast enough to attain peak GPU performance. So more data comes from registers, and higher performance can be achieved. Sometimes having a few more registers per thread also can prevent register spilling and preserve high performance [19]. 5.2.3. Other optimizing methods We also want to use shared memory to hold the portion of global memory data that are frequently used in an execution phase of the kernel. Shared memory is an on-chip memory. Variables that reside in it can be accessed at very high speeds. However, shared memory is quite small and we must avoid exceeding its capacity when loading the data. Shared memory is allocated to thread blocks, all threads in a block can access variables in the shared memory locations allocated to the block [19]. In our code, each thread computes all the data of one VLB group and every GPU block needs to process a great quantity of data, but shared memory is a scarce resource and our GPU only provides 48 kB shared memory per SM. So when the number of threads in each block is small, the shared memory can be used in our work. Fortunately, it is found that 4 threads per block can gain better
Fig. 8. Comparison of using shared memory.
performance in tests. In this case, we attempt to load the location array of the blocks to shared memory because these data need to be updated at each stage of the EREC that they are frequently accessed. The process of loading these data to shared memory is expressed in Fig. 7. The data in each GPU block are read into shared memory first. A private version of the shared variable is created and used by each thread block during kernel execution. The scope of a shared variable is within a thread block so that all threads in a block see the same version of a shared variable. Then each block accesses the data that reside in shared memory at a high speed. Fig. 8 shows the performance when using shared memory based on the optimization before. It can be seen that the execution time is almost the same when N is relatively small and it performs better when N is bigger. In addition, asynchronous transfers are tried, and pinned memory is used for the same purpose. The test result is obvious as it shows that the performance could not get better because the time of data transmission in our scheme accounted for a small percent of the entire processing time. 5.3. Groups allocation In the above tests, N blocks are all split into 128 groups. It can get much better performance if the blocks are divided into more groups. Fig. 9 shows the performance of the tested data which are divided into 128 and 256 groups. We can see that a higher speedup
Y. Dai et al. / J. Parallel Distrib. Comput. 73 (2013) 411–419
417
Fig. 10. Speedup comparison of EREC on GPU and P-EREC on GPU (CPU, resp.) Fig. 9. The performance of two different groups.
can be obtained if the data are split into more groups, since larger group numbers not only raises the parallelism degree but also reduces the load of each parallel thread. In addition, iteration number of the algorithm is also decreased significantly by more groups. So it boosts the performance to a large extent. However, more groups trigger more redundant information which reduces the compression efficiency. Hence a suitable n should be chosen according to actual requirements. If it is applied to the occasion of high real-time requirement, a larger n can be selected in order to gain higher speedup, while in the situation where the capacity of network bandwidth is limited, a smaller n should be chosen. In practice, we select a big n for mass data. 5.4. Analysis on P-EREC The performance, resilience and complexity of the P-EREC are discussed in detail below. In addition, the decoder can recover the bits of VLBs in each group simultaneously when using the P-EREC, but it must know the length of the total bits in each group, which becomes redundant information. Moreover, these data are important and they must be transmitted robustly. 5.4.1. Speedup performance As the previous sections state, the implementation of the original EREC on GPU has poor efficiency. For the sake of good performance of the EREC, we introduce the P-EREC which can reduce the computational complexity compared to the EREC. The implementation efficiency of the P-EREC is better than the EREC even on CPU. What is more important, the P-EREC could be processed effectively in coarse-grained parallel on GPU and avoid the additional communication overhead caused by the original parallel EREC. Moreover, more divided groups can bring higher speedup since it not only raises the parallelism degree but also reduces the load of each parallel thread. Meanwhile, the iteration number of the algorithm is also decreased significantly with more groups. For those advantages we further optimize the P-EREC on GPU to boost the performance by a big margin. The speedup of the EREC on GPU and the P-EREC on GPU (CPU, resp.) over the serial EREC are displayed separately in Fig. 10. It is found that the EREC on GPU only gets a little speedup. The P-EREC on CPU can obtain better performance and more divided groups bring higher speedup. The P-EREC on GPU achieves the highest speedup in our experiment. The parallel P-EREC can run 32–123 times faster than the serial EREC code when the data is
Fig. 11. The error resilience of P-EREC and EREC.
separated into 128 groups and 54–350 times faster in the event of the data divided into 256 groups. It means that the more groups are separated, the better performance the P-EREC can achieve on GPU. However, the GPU-based P-EREC speedup result still depends on data distribution and size. 5.4.2. Error resilience Compared to the EREC, the P-EREC increases the error resilience of the VLC bit-stream to random and burst errors because it limits error propagation within each VLB group that they are mutually independent. The scheme is examined in the simulation with binary symmetric channel (BSC). Various error patterns are tested with BER = 1.0E − 2 to 1.0E−5. For each kind of error pattern, bit errors are generated randomly. Fig. 11 demonstrates the fraction of blocks that are affected at various BER (L = 1 024 506, N = 8192, n = 128). A block is deemed to be affected if one or more bits are transmitted in error. It is found that the P-EREC can reduce significantly the channel error propagation effects when BER < 0.1%. In this case, the percentage of affected blocks of the P-EREC decreased more rapidly than the EREC with the declination of BER, and the proportion of affected blocks is about 1%–5% lower than the EREC. 5.4.3. Complexity It is shown in [18] that the complexity of the EREC algorithm is of the order of N log N for N blocks. So the complexity of the
418
Y. Dai et al. / J. Parallel Distrib. Comput. 73 (2013) 411–419
P-EREC algorithm (if N blocks are divided into n groups, where n is a constant) can be reduced to N (log N − log n). 5.4.4. Redundancy The P-EREC improves the performance on GPU at the expense of slightly more redundant header information. The decoder must know the length of total bits in each group except the values of N and T because each group recovers the blocks independently. That is, this method involves side information of n length numbers and they are important for decoding the data correctly. 6. Conclusions In this paper, we proposed an improvement to the EREC which optimizes the performance of the EREC with parallel computing by exploiting the enormous power of the GPU using NVIDIA CUDA technology. In order to obtain higher speedup over the sequential EREC, partitioning the EREC (P-EREC) was introduced. It works by splitting the VLBs into groups. In each group, blocks of data can be coded independently with the EREC. Since the decreasing of complexity, the implementation efficiency of the P-EREC is higher than the EREC even on CPU. On this basis, the P-EREC is further optimized on GPU with CUDA to boost the performance substantially. Constant memory, texture memory, shared memory, thread allocation for high performance and other optimized techniques were fully used. They could speed up memory accessing, hide both ALU and memory latencies, and achieve higher bandwidth. All of them helped our CUDA code to get better performance on GPU. The experimental results show that the GPU-based P-EREC speedup result depends on data size and distribution. When the VLBs in Table 1 are separated into 128 groups (256 groups, resp.), the parallel P-EREC gains 32× to 123× (54× to 350×, resp.) speedup for different data sizes compared with the original C code of the EREC. And it can achieve higher speedup if there are more divided groups. In brief, the parallel PEREC overcomes the shortage of time consumption accompanied by sequential CPU code and could meet the requirement of realtime processing. Hence it will be more widely used in image and video compression schemes. It was also found that the P-EREC increases the resilience of the VLC bit-stream to random and burst errors. It can reduce the channel error propagation effects when BER < 0.1% and the percentage of affected blocks is about 1%–5% lower than the original EREC (L = 1 024 506, N = 8192, n = 128), though it brings slightly more redundant information. In a word, it is economical and robust. In addition, we focused on the implementation of the algorithm on GPU in this paper. To compare the result of the parallelism for the EREC, we are thinking about cluster or cloud computing in our next work. Moreover, the EREC belongs to an iterative algorithm which is difficult to parallel. Luckily, the idea of partitioning the EREC (P-EREC) is very good. After partitioning, it can be processed effectively in coarse-grained parallel and achieves good performance. Therefore, the idea of partitioning can also be applied to other iterative algorithms or schemes in order to parallel effectively. References [1] I. Ahmad, Y. He, M.L. Liou, Video compression with parallel processing, Parallel Computing 28 (2002) 1039–1078. [2] N. Anderson, J. Mache, W. Watson, Learning CUDA: lab exercises and experiences, in: SPLASH’10, 2010, pp. 183–187. [3] P. Bakkum, K. Skadron, Accelerating SQL database operations on a GPU with CUDA, in: GPGPU-3, 2010, pp. 94–103.
[4] V. Boyer, D. El Baz, M. Elkihel, Solving knapsack problems on GPU, Computers & Operations Research 39 (2012) 42–47. [5] T.T. Cao, K. Tang, A. Mohamed, T.S. Tan, Parallel banding algorithm to compute exact distance transform with the GPU, in: I3D, 2010, pp. 83–90. [6] R. Chandramouli, N. Ranganathan, S.J. Ramadoss, Adaptive quantization and fast error resilient entropy coding for image transmission, IEEE Transactions on Circuits and Systems for Video Technology 8 (1998) 411–421. [7] N.T. Cheng, N.G. Kingsbury, The ERPC: an efficient error-resilient technique for encoding positional information or sparse data, IEEE Transactions on Communications 40 (1992) 140–148. [8] A. Colic, H. Kalva, B. Furht, Exploring NVIDIA-CUDA for video coding, in: Proceedings of the First Annual ACM SIGMM Conference on Multimedia Systems, 2010, pp. 13–22. [9] C.D. Daniel, M. Dominik, S. Andreas, P. Sabine, S.F. Achilleas, Performance evaluation of image processing algorithms on the GPU, Journal of Structural Biology 164 (2008) 153–160. [10] J.M. Elble, N.V. Sahinidis, P. Vouzis, GPU computing with Kaczmarz’s and other iterative algorithms for linear systems, Parallel Computing 36 (2010) 215–231. [11] Y. Fang, EREC-based length coding of variable-length data blocks, IEEE Transactions on Circuits and Systems for Video Technology 20 (2010) 1358–1366. [12] J.S. Kim, J.D. Kim, K.Y. Lee, The efficient and robust error resilient entropy coding of compressed image for wireless communications, IEICE Transactions on Fundamentals 88 (2005) 1448–1454. [13] D.B. Kirk, W.M.W. Hwu, Programming Massively Parallel Processors, 2010. [14] A. Krishnamurthy, S. McMains, Accurate GPU-accelerated surface integrals for moment computation, Computer-Aided Design 43 (2011) 1284–1295. [15] A.E. Nocentino, P.J. Rhodes, Optimizing memory access on GPUs using morton order indexing, in: ACMSE’10, 2010. [16] NVIDIA, NVIDIA CUDA C Programming Guide, 2010. [17] K.S. Oh, K. Jung, GPU implementation of neural networks, Pattern Recognition 37 (2004) 1311–1314. [18] D.W. Redmill, N.G. Kingsbury, The EREC: an error resilient technique for coding variable-length blocks of data, IEEE Transactions on Image Processing 5 (1996) 565–574. [19] F. Rob, CUDA application design and development, 2010. [20] J. Sanders, E. Kandrot, CUDA by Example: An Introduction to General-Purpose GPU Programming, first ed., Addison-Wesley Professional, 2010. [21] X.H. Shi, C. Li, S.H. Wang, X. Wang, Computing prestack Kirchhoff time migration on general purpose GPU, Computers & Geosciences 37 (2011) 1702–1710. [22] D. Song, L. Cao, C.W. Chen, Robust multiple description image coding over wireless networks based on wavelet tree coding, error resilient entropy coding, and error concealment, Journal of Visual Communication and Image Representation 19 (2008) 311–319. [23] S.S. Veysi İşler, A parallel algorithm for UAV flight route planning on GPU, International Journal of Parallel Programming 39 (2011) 809–837. [24] C.T. Yang, C.L. Huang, C.F. Lin, Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters, Computer Physics Communications 182 (2011) 266–269. [25] H.F. Zheng, L. Yu, C.W. Chen, Robust video transmission based on multiple description scalable coding with EREC, Proceedings of the SPIE 5960 (2005) 1891–1902.
Yuan Dai received her M.S. degree from Northwest A&F University Yangling, China, in 2007. She is a Ph.D. candidate in Northwest A&F University. She is currently a lecturer with Northwest A&F University. Her research interests are GPU application, and distributed and parallel computing.
Yong Fang received his B.Eng., M.Eng., and Ph.D. degrees from Xidian University, Xi’an, China, in 2000, 2003 and 2005, respectively, all in signal processing. In 2005, he was appointed as a Post-Doctoral Fellow for one year with Northwest Polytechnical University, Xi’an. From 2007 to 2008, he was with Hanyang University, Seoul, Korea, as a Research Professor. He is currently a full Professor with Northwest A&F University, Yangling, China. He has long experiences in hardware development, e.g., FPGAbased video codec design, DSP-based video surveillance systems, and so on. His current research interests include distributed source coding, image/video coding, processing, and transmission.
Y. Dai et al. / J. Parallel Distrib. Comput. 73 (2013) 411–419 Dongjian He received his M.S. and Ph.D. degrees from Northwest A&F University Yangling, China, in 1985 and 1998, respectively. He is the dean and Ph.D. adviser of the College of Mechanical and Electronic Engineering, Northwest A&F University and the managing dean of the ‘‘Yangling International Academy of Modern Agriculture’’. His research focuses on intelligent detecting and control, image analysis and recognition, and agricultural information technology.
419
Bormin Huang received his B.S. from National Taiwan University and the MSE in aerospace engineering from the University of Michigan-Ann Arbor. He received the Ph.D. in the area of satellite remote sensing from the University of Wisconsin-Madison in 1998. He worked in the NASA Langley Research Center for 3 years. He is currently a research scientist at the Space Science and Engineering Center, the University of WisconsinMadison. He has broad research interests in remote sensing data processing and coding. Since 2004 he has been the principal investigator of the NOAA satellite data compression project for the US next-generation ultraspectral sounders.